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CHAPTER  1 


INTRODUCTION 


1.1  Introduction 

The  influence  of  very  large  scale  integrated  (VLSI)  circuit  technology  on  our  society 
during  the  past  decade  has  been  overwhelming,  in  application  areas  ranging  from  con¬ 
sumer  products  to  personal  computers,  to  business  management,  to  defense  electronics. 
The  functional  capability  of  the  modem  integrated  circuit  (IC)  has  increased  in  scope 
and  complexity  exponentially  with  time  over  the  past  two  decades.  The  exponential 
growth  pattern  in  IC  functions  over  time  was  first  described  by  Gordon  Moore  [1],  and 
the  projection  he  made  based  on  this  pattern  is  known  as  Moore’s  law. 

The  creation  of  large,  complex  electronic  systems  has  grown  beyond  the  capabilities 
of  many  engineers  without  the  aid.  of  computers.  Successful  completion  of  large  design 
projects  reqtiires  that  computers  be  used  in  virtually  all  aspects  of  the  design  process. 
This  trend  toward  automation  will  accelerate  as  improved  circuit  fabrication  technologies 
permit  higher  levels  of  integration  and  as  more  powerful  computers  allow  more  sophisi- 
cated  tools.  These  tools  must  span  the  spectrum  of  the  design  process,  including  par¬ 
titioned  design  entry,  logic  sjmthesis,  circuit  design,  circuit  simulation  and  verification, 
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physical  design,  process  simulation,  and  the  design  for  testability  and  manufacturability. 

These  tools  are  commonly  implemented  and  termed  as  computer-uded  design  (CAD)  or 

electronic  design  automation  (EDA)  programs.  The  evolution  of  integrated  circuit  devel- 

» 

opment  has  become  heavily  dependent  on  the  development  of  CAD  and  EDA  resources 
for  design  support. 

In  this  thesis,  we  examine  two  problems  in  the  field  of  electronic  design  automation: 
(1)  gate  sizing  for  combinational  circuits  and  sequential  circuits,  and  (2)  timing  and  area 
optimization  for  a  compact  placement.  Before  we  go  into  details  of  our  work,  we  give  a 
brief  description  of  the  electronic  systems  design  process. 


1.2  The  Process  of  Electronic  System  Design 

A  typical  IC  design  process,  shown  in  Figure  1.1,  is  composed  of  the  four  following 
phases  [2,3]:  system  design,  logic  design,  circuit  design,  and  physical  design.  They  are 
briefly  described  in  the  following. 

1.2.1  System  design 

System  design  is  the  process  of  defining  the  circuit  functionality  and  the  input-output 
behavior.  A  behavior  representation  describes  how  a  particular  design  should  respond  to 
a  given  set  of  inputs.  Behavior  may  be  specified  by  Boolean  equations,  tables  of  input 
and  output  values,  or  algorithms  written  in  high-level  computer  languages,  or  hardw2Lre 
description  languages  (HDL). 
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Figure  1.1  A  typical  IC  desi^  process. 

As  far  as  the  physical  aspect  of  the  design  is  concerned,  at  this  level  one  is  concerned 
with  connecting  the  major  subsystems  and  communication  interfaces  with  the  external 
world;  global  wiring  strategies;  selecting  layers  for  carrying  global  control,  data,  and 
power;  placement  of  major  subsystem;  and  routing  strategies. 

1.2.2  Logic  design 

Logic  design  is  the  process  of  transforming  the  register  transfer  level  (RTL)  specifica¬ 
tion  of  a  design  into  a  netlist  of  logic  gates  such  as  NAND  gates,  NOR  gates,  inverters, 
AOI  gates  and  latches.  This  process  begins  with  logic  descriptions  given  by  the  RTL  spec¬ 
ification  or  generated  by  designers  directly  at  the  logic  level,  and  optimizes  the  network 
of  gates  that  are  required  to  implement  the  function  specified  by  the  logic  dacriptions. 
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The  design  of  random  logic  has  objectives  such  as: 

•  minimize  overall  layout  area  of  the  fabricated  chip; 

•  minimize  critical  path  delay  time; 

•  maximize  testability  of  the  synthesized  logic. 

Generally,  a  logic  design  system  divides  the  design  problem  into  two  steps  [4]: 

•  A  technology-independent  step,  which  manipulates  general  Boolean  functions  to 
optimize  the  logic,  using  algebraic  and/or  Boolean  techniques. 

•  A  technology-mapping  step,  which  translates  the  technology-independent  descrip¬ 
tion  derived  in  the  first  step  to  a  set  of  logic  gates  that  can  be  implemented  in  the 
design  method  of  choice  (e.g.,  standard-ceUs,  gate-arrays,  field-programmable  gate 
arrays). 

1.2.3  Circuit  design 

The  circuit  design  phase  concerns  the  electrical  laws  that  govern  the  detailed  behavior 
of  the  basic  circuit  elements  such  as  transistors,  resistors,  capacitors,  and  inductors.  It 
transforms  the  basic  logic  components  into  networks  of  transistors  and  interconnects. 

Delay,  power  consumption,  charge  sharing  problem,  and  reliability  are  among  the 
major  concerns  in  this  phase. 
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1.2.4  Pl^sical  design 


Physical  design  consists  of  transforming  a  circuit  design  description  into'  a  physicjJ 
representation  that  can  be  used  to  manufacture  the  specified  electronic  circuit.  Once  the 
circuit  description  of  a  network  is  available^  it  can  be  converted  into  a  layout.  Behavioral 
or  structural  representations  from  the  previous  phases  are  transformed  into  geometric 
shapes  that  are  used  in  the  fabrication  of  the  system.  Placement  and  routing  are  the  two 
major  tasks  in  this  phase. 

Placement  is  the  task  of  placing  modules  adjacent  to  each  other  on  a  chip  to  minimize 
area  or  delay.  The  placement  procedure  determines  the  locations  of  components  within 
the  circuit  being  designed,  subject  to  the  constraints  imposed  by  the  designers  and  the 
design  rules  imposed  by  the  fabrication  process  and  by  phyncal  principles. 

Following  placement,  components  are  arranged  on  the  chip,  and  the  task  remains  to 
insert  the  electrical  connections  among  the  components  to  make  them  function  correctly. 
A  router  takes  a  module  placement  and  a  list  of  connections  and  connects  the  components 
with  wires. 

Often,  an  iterative  process  of  placement  and  routing  is  used  to  optimize  certain  ob¬ 
jectives,  such  as  performance  or  layout  area,  of  the  design. 


1.3  Design  Styles 

There  are  various  chip  design  options  that  may  be  used  to  implement  a  system  design, 
such  as  sea-of-gate,  gate  array,  standard-cell  design,  and  full-custom  design.  These  VLSI 
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gate-anay 


fiiU-custom 


standard-cell  design 

increasing  dxcoit  density  - 

increasing  perfonnance  - 

increasing  dxcoit  flezibility - 

increasing  design  time 

increaiing  checking  - 

increasing  design  antomation - 

ability  to  accommodate  - 

(mnltiple-pass)  design  changes 

Figure  1.2  Engineering  trade-offii  among  different  design  styles. 

design  approaches  require  different  trade-offs  and  impose  different  constraints  on  the  chip 
physical  design  in  an  attempt  to  make  the  design  more  manageable,  while  maintaining 
suffident  design  flexibility. 

The  trade-offs  among  these  different  approaches  are  illustrated  in  Figure  1.2  [5].  In 
the  following,  we  briefly  discuss  the  advantages  and  disadvantages  of  different  design 
styles,  namely,  gate-array,  standard-cell,  and  full-custom  designs. 

To  develop  a  full-custom  design,  engineering  groups  are  assembled  to  cover  the  wide 
range  of  skills  required  to  design  the  part  virtually  from  scratch.  These  groups  may 
include  experts  in  process  engineering,  device  modeling,  drcuit  design,  physical  design 
layout,  logic  design,  and  system  architecture  design.  The  final  design  is  optimized  for  the 
best  density  and  performance.  However,  the  design  turnaround  time  is  usually  large.  In 
addition,  due  to  the  large  design  effort  required,  a  full-custom  design  is  desired  only  for 
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high  volume,  for  which  the  initial  engineering  expense  can  be  compensated  over  a  long, 
active  product  life. 

In  the  gate-array  design,  several  of  the  lithographic  patterning  levels  are  standardized, 
except  for  the  interconnections  and  via  geometries.  The  devices  used  to  implement  circuit 
designs  are  prefabricated  on  a  chip,  but  are  left  xmconnected  after  the  initial  processing 
step.  A  circuit  design/logic  block  is  placed  at  a  specified  ceil  location  by  assigning  the 
appropriate  pattern  of  wires  to  coordinate  inside  that  cell  area;  these  wires  connect 
the  devices  to  form  the  selected  logic  gate.  Different  logic  blocks  are  then  connected  to 
implement  the  desired  logic  function.  The  disadvantage  of  gate  arrays  is  that  they  aire  not 
optimal  for  any  task.  There  are  usually  blocks  that  are  not  used.  Since  block  placement 
is  done  in  advance,  interconnect  routing  can  become  complex  and  the  resulting  long  wires 
can  slow  down  the  circuit.  Also,  the  design  will  not  be  compact  since  interblock  spacing 
is  fixed  to  allow  worst-case  routing  needs.  Another  problem  with  the  gate-array  approach 
is  that  the  transistor  patterns  are  predefined.  Therefore,  the  transistors  cannot  be  tuned 
to  the  specified  application.  This  leads  to  inferior  performance  compared  to  full-custom 
design. 

Between  these  two  extremes  lies  the  standard-cell  approach  which  strives  for  high 
design  system  support  for  chip  physical  design  and  the  capability  to  locally  optimize 
circuit  designs  using  hand-crafted  cells  and  layouts.  This  approach  involves  the  use  of 
a  library  of  basic  functional  elements,  each  of  which  has  been  fully  characterized.  In 
the  standard-cell  design  approach,  a  division  is  made  between  the  tasks  of  hsmdcrafting 
circuit  designs  and  placing  and  wiring  those  circuit  blocks  together.  This  separation  is 
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based  on  the  assumption  that  the  time-consuming  task  of  handcrafting  a  custom  layout 
is  best  restricted  to  small  circuit  designs  only.  The  initial  circmt  design  and  layout  are 
done  once,  and  the  resulting  shapes  stored  in  a  technology  library  for  repeated  use  across 
many  designs. 


1.4  Standard-ceU  Design 

The  Standard-ceU  approach  has  the  advantage  of  greatly  simpUfying  the  automated 
synthesis  process  because  it  separates  the  synthesis  system  from  the  details  of  cell  lay¬ 
out  issues.  The  ceU  Ubrary  presents  models  for  individual  cells,  which  are  useful  for 
performing  circuit  and  timing  analyses.  With  the  aid  of  many  advanced  CAD  tools,  the 
performance  of  a  standard-ceU  designed  circuit  is  furly  high,  while  the  design  turnaround 
time  is  fast.  Therefore,  this  approach  has  become  the  mainstay  in  Application-Specific 
Integrated  Circuits  (ASICs). 

A  crucial  issue  in  the  ceU  Ubrary  approach  is  the  size  of  the  Ubrary.  If  the  Ubrary  is  too 
smaU,  much  time  is  spent  in  converting  the  logic  into  a  format  that  cam  be  supported  by 
the  smaU  Ubrary.  On  the  other  hand,  if  the  Ubrary  size  is  too  large,  the  issues  of  database 
maintenance,  pattern  matching  and  searching  become  significant  [6].  Moreover,  the  useful 
Ufe  of  a  Ubrary  is  relatively  short  as  dictated  by  the  Ufetime  of  the  technology  in  use  [7]. 
For  these  reasons,  the  ceU  Ubraries  tend  to  remain  relatively  small  in  size. 

The  prevalent  use  of  complex  gates  such  as  AOI  or  OAI  further  complicates  the 
library  issue.  As  shown  in  Table  1.1  [8],  as  many  as  3,503  different  complex  gates  can  be 
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Table  1.1  Number  of  (3,p)-gates. 


(3,P) 

1 

3 

4 

5 

6 

1 

2 

3 

4 

5 

6 

2 

7 

18 

42 

90 

186 

3 

18 

87 

396 

1677 

6877  • 

4 

42 

396 

3503 

28435 

222943 

5 

5 

90 

1677 

28435 

425803 

6084393 

6 

6 

186 

6877 

222943 

6084393 

154793519 

configured  for  {s,  p)  s  (4, 4),  where  the  gates  are  constrained  to  have  at  most  s  transistors 
from  output  to  ground  and  p  transistors  from  output  to  power  supply.  This  number 
dramatically  increases  to  425,803  for  (s,p)  =  (5,5)  and  154,793,519  for  (s,p)  =  (6,6). 
It  is  apparent  that  a  moderately  sized  library  cannot  support  all  of  the  possible  circuit 
configurations  for  complex  gates. 

The  other  problem  with  the  standard-cell  approach  is  that  even  if  the  individual 
cells  in  the  library  are  nearly  optimal  in  performance  and  in  terms  of  compactness  of  the 
layout,  the  whole  circuit  is  often  suboptimal  after  all  cells  are  put  together.  For  flexibility, 
many  standard-cell  libraries  contain  multiple  versions  of  some  cells  with  different  driving 
powers. 

1.5  Discrete  Gate-Sizing  Problem 

1.5.1  Optimization  for  combinational  circuits 

In  general,  circuit  delay  and  circuit  area  are  the  primary  concerns  of  any  logic  design 
optimization.  In  many  cases,  a  reduction  in  the  number  of  stages  (gates)  between  an  input 
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and  an  output  node  can  reduce  the  circuit  area  and  delay.  Such  an  optimization  is  usually 
made  during  the  logic  synthesis  stage.  This  reduction  is  not,  however,  guuanteed  to 
reduce  the  circuit  delay.  A  standard-cell  library  typically  contains  several  versions  of  any 
given  gate  type.  Cells  of  identical  gate  type  differ  from  each  other  in  attribution,  such  as 
driving-capability,  gate  area,  and  input  capacitive  load.  Because  of  these  differences,  the 
selection  of  cell  versions  for  each  individual  ga^e  in  the  circuit  has  a  profound  impact  on 
the  characteristics  (i.e.,  delay,  circuit  area,  and  power  consumption)  of  the  whole  circuit. 
By  means  of  gate  sizing,  a  fixed-topology  logic'  circuit  can  be  significantly  optimized.  In 
this  thesis,  we  assume  that  a  logic-level  circuit  description  is  provided,  and  the  objective 
is  to  perform  gate-size  selection  in  an  optimal  way.  The  logic  synthesis  stage  is  usually 
performed  before  the  technology  mapping  stage.  Hence,  we  do  not  address  this  issue  in 
this  thesis. 

Given  a  netlist  of  a  logic  circuit  and  a  cell  library,  an  automatic  gate-size  optimization 
algorithm  chooses,  from  the  library,  one  version  of  a  logic  gate  for  each  cell  such  that 

(1)  the  total  circuit  delay  is  under  a  constraint  and  an  objective  fimction  (such  as 
circuit  area  or  power  consumption)  is  minimized; 

(2)  the  total  circuit  area  (or  power  consumption)  is  under  a  constraint  and  the  circuit 
delay  is  minimized. 

The  former  is  called  the  area  (power)  optimization  problem  while  the  latter  is  called 
the  timing  optimization  problem.  In  this  thesis,  we  concentrate  on  the  first  problem; 
specifically,  we  minimize  total  circuit  area.  It  should  be  mentioned  that  the  algorithm 
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presented  in  this  thesis  can  be  extend  to  the  power  optimization  problem  and  timing 
optimization  under  area  or  power  constraint  as  well. 

1.6  Optimization  for  Sequential  Circuits 

Optimization  for  synchronous  sequential  circuits,  on  the  other  hand,  is  different  from 
combinational  circuit  optimization.  An  additional  degree  of  freedom  is  available  to  the 
designer  in  that  one  can  set  the  time  at  which  clock  signals  arrive  at  various  flip-flops 
(FFs)  in  the  circuit  by  controlling  interconnect  delays  in  the  clock  signal  distribution 
network.  With  such  adjustments,  it  is  possible  to  change  the  delay  specifications  for  the 
combinational  stages  of  a  synchronous  sequential  circuit  to  allow  for  better  sizing. 

After  devel<^ing  an  optimization  algorithm  for  combinational  circuits  in  Chapter  2, 
we  present  an  optimization  technique  for  synchronous  sequential  circuits  in  Chapter  3. 
We  examine  the  following  problem:  Given  a  clock  period  specification,  how  can  the  area 
of  a  synchronous  sequential  circuit  be  mininoized  by  appropriately  selecting  a  size  for 
each  gate  in  the  circuit  from  a  standard-cell  library,  and  by  adjiisting  the  delays  between 
the  central  clock  and  individtial  flip-flops? 

In  general,  given  a  combinational  subcircuit  that  lies  between  two  FFs  i  and  j,  with 
clock  arrival  times  s,-  and  Sj,  respectively,  we  have  the  following  relations: 

Si  -H  M axdelay{i^  j)  '^setxtp  ^  'Sjf  (I'l) 


Si  M indelay {i,j)  >  s,  -H  ThoU 


(1.2) 


where  Maxdelay{i,j)  and  Mindelay{i,j)  are,  respectively,  the  maximum  and  the  mini¬ 
mum  combinational  delays  between  the  two  FFs,  and  P  is  the  clock  period.  Fishburn  [9] 
studied  the  clock  skew  problem  under  the  assumption  of  constant  combinational  gate 
delays,  and  formulated  the  problem  of  finding  the  optimal  clock  period  and  the  optimal 
skews  as  a  linear  program  (LP).  The  objective  was  to  minimize  P,  with  the  constraints 
given  by  the  inequalities  in  (1.1)  and  (1.2)  above.  In  real  design  situations,  however,  P 
is  dictated  by  system  requirements,  and  the  real  problem  is  to  reduce  the  circuit  area. 

We  first  consider  optimizing  circuits  of  moderate  size.  Then,  in  Chapter  4,  we  consider 
arbitrarily  large  synchronous  sequential  circuits  for  which  the  size  of  the  formulated 
optimization  problems  becomes  prohibitively  large,  and  present  a  partitioning  algorithm 
to  handle  such  circuits.  The  partitioning  algorithm  is  used  to  control  the  computational 
cost  of  the  optimization  problems.  After  the  partitioning  procedure,  we  can  apply  the 
optimization  algorithm  to  each  partitioned  subdrcuit  individually. 

1.7  Performance-driven  Placement 

To  ensure  that  high-quality  designs  are  produced,  a  CAD  system  must  take  two 
important  issues  into  consideration  while  designing  a  circuit: 

•  Layout  efficiency:  producing  a  compact  circuit  layout. 

•  Performance:  satisfying  the  timing  specifications  dictated  by  the  clocking  scheme. 

With  the  increasing  drive  for  high-performance  chips,  the  timing-driven  layout  has  be¬ 
come  more  and  more  important. 
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Conventional  (area-driven)  placement  tools  try  to  place  modules  in  a  chip  to  mini¬ 
mize  the  total  wire  length.  However,  as  device  geometries  continue  to  shrink,  interconnect 
delays  oecome  increasingly  significant.  As  a  result,  the  reduction  of  interconnect  wire 
length,  which  heavily  influences  the  intercoimect  delay,  has  become  increasingly  impor¬ 
tant. 

Recently,  there  has  been  extensive  research  on  performance-driven  placement  [10-13]. 
Performance-driven  placement  techniques  can  be  broadly  divided  into  two  categories: 
net-oriented  and  path-oriented.  In  the  net-oriented  approach,  the  acceptable  delay  of 
each  gate  (cell)  is  calculated  and  translated  into  bounds  on  the  delay  associated  with 
each  net.  These  bounds  then  serve  as  constraints  during  the  subsequent  placement  step. 
In  the  path-oriented  approach,  timing  analyses  of  critical  paths  are  performed  dynami¬ 
cally  during  the  placement  step.  All  paths,  or  a  subset  of  them,  are  taken  into  account 
implicitly  in  the  formulation. 

Conventionally,  gate  sizing  is  performaed  after  technology  mapping,  and  before  the 
physical  placement  step.  A  drawback  of  such  am  approach  is  that  aurcurate  interconnect 
wire  lengths  aire  not  available  during  the  gate-sizing  procedure.  The  gate  size  selected 
optimally  at  that  stage  may  no  longer  be  optimal  adter  the  physicad  design  stage  in 
which  large  interconnect  capacitances  are  introduced  at  the  output  of  eamh  gate.  To  deal 
with  this  problem,  am  iteration  procedure  is  usuadly  followed.  After  global  placement, 
the  capacitance  aissociated  with  eamh  net  is  extracted,  amd  the  gate-sizing  procedure 
is  repeated.  However,  in  such  am  iterative  approach,  the  vauriation  of  net  capacitance 
between  iterations  may  be  large  and  cause  lairge  perturbation  in  the  solutions.  Thus,  a 
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Figure  1.3  Advantage  of  gate  sizing  together  with  placement. 

number  of  iterations  may  be  required,  making  this  approach  quite  expensive.  To  deal 
with  this  problem,  it  is  desirable  that  gate  sizing  and  placement  be  incorporated  into  a 
single  procedure. 

As  an  illustration,  consider  a  layout  placement  shown  in  Figure  1.3(a).  Gate  D  fans 
out  to  gates  L\,L-i  and  Lz-  Assume  that  the  delay  of  this  circuit  under  such  layout 
violates  timing  constraints  imposed  on  it.  Moreover,  D  and  Lz  lie  on  a  long  path  whose 
delay  exceeds  the  timing  constraint.  Conventional  performance-driven  placement  would 
move  D,  L\ ,  Lz  and  Lz  closer  to  each  other  to  decrease  the  delay  of  gate  D,  as  shown  in 
Figure  1.3(b).  This  may  increase  the  wire  lengths  of  other  nets  attached  to  cells  D,  L\ ,  Lz 
and  Lz.  But  if  automatic  gate  sizing  is  incorporated  with  performance-driven  placement, 
a  possible  solution  would  be  to  replace  D  with  a  template  with  a  higher  driving  capacity, 
and  L\  with  one  with  a  smaller  loading  capacitance  with  respect  to  D.  As  a  result, 
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some  of  the  cells  could  be  moved  to  better  locations,  as  shown  in  Figure  1.3(c).  The 
overzdl  effect  is  a  reduction  of  the  long  path  delay,  while  the  increase  in  area  is  kept  to  a 
minimum. 

In  Chapter  5,  we  propose  an  algorithm  which  combines  the  gate-sizing  problem  and 
performance-driven  placement,  into  one  procedure.  By  considering  these  two  problems 
together,  the  value  of  interconnect  capacitance  is  known  during  the  selection  stage  of 
the  automatic  sizing  procedure.  Therefore,  optimal  gate  sizes  can  be  chosen  for  each 
gate  based  on  layout  information,  thus  reducing  the  number  of  iterations  required  in  the 
conventional  approach. 


1.8  Organization  of  the  Thesis 

Chapter  2  of  the  thesis  deals  with  discrete  gate  sizing  for  combinational  circuits.  In 
Chapter  3,  we  formulate  the  synchronous  sequential  circuit  area  optimization  problem 
and  present  the  algorithms  to  tackle  the  problem.  The  partitioning  algorithm  presented 
in  Chapter  4  allows  us  to  handle  large  circuits.  Chapter  5  of  the  thesis  discusses  a  novel 
approach  to  timing-driven  placement.  Finally,  concluding  remarks  are  made  in  Chapter  6. 
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CHAPTER  2 

i 

DISCRETE  GATE-SIZING 
PROBLEM 


2.1  Introduction 

The  delay  of  a  MOS  integrated  circuit  can  be  tuned  by  appropriately  choosing  the 

• 

sizes  of  transistors  in  the  circuit.  While  a  combinational  MOS  circuit  in  which  all  tran¬ 
sistors  have  the  mixiimum  size  has  the  smallest  possible  area,  its  circuit  delay  may  not 
be  acceptable.  It  is  often  possible  to  reduce  the  delay  of  such  a  circuit,  at  the  expense  of 
increased  area,  by  increasing  the  sizes  of  certun  transistors  in  the  circuit.  The  optimiza¬ 
tion  problem  that  deals  with  this  areardelay  trade-off  is  known  as  the  sizing  problem. 
In  general,  the  interaction  between  the  size  of  a  certain  gate  and  the  delay  of  the  whole 
circuit  is  very  complicated.  A  larger  cell  usually  has  a  larger  driving  capability  and  a 
larger  input  capacitive  load.  Therefore,  using  a  large  template  tends  to  speed  up  the  gate 
itself  while  slowing  down  the  predecessor  gates  that  dr^ve  it. 

Example  2.1  Consider  the  chain  of  three  CMOS  inverts  shown  in  Figure  2.1(a)  [14]. 
Let  the  width  of  both  the  n-type  and  p-type  transistors  in  gate  2  be  and  let  D  be  the 
total  delay  through  the  three  gates.  Consider  the  effect  of  increasing  W],  while  keeping 
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Figure  2.1  (a)  A  chain  of  three  inverters,  (b)  Effect  of  transistor  sizes  on  delay  for  the 
three-inverter  chain. 


the  size  of  the  transistors  in  gates  1  and  3  fixed.  This  causes  the  magnitude  of  the  output 
current  of  gate  2  to  increase,  thus  the  time  required,  d),  for  gate  2  to  drive  its  output 
signal  will  decrease  monotonically  (Figure  2.1(b)).  However,  increasing  W2  also  increases 
the  capacitive  load  on  the  output  of  gate  1,  thus  slowing  down  the  output  transition  of 
the  first  gate.  Beyond  a  certain  point,  ss  A,  the  total  delay,  D,  starts  to  increase  with 
respect  to  which  shows  the  nonmonotonicity  of  the  delay-area  relationship.  □ 

The  rationale  for  dealing  with  only  combinational  circuits  in  a  world  rampant  with 
sequential  circuits  is  as  follows.  A  typical  MOS  digital  integrated  circuit  consists  of  mul¬ 
tiple  stages  of  combinational  logic  blocks  that  lie  between  latches,  clocked  by  system 
clock  signals.  Delay  reduction  must  ensure  that  the  worst-case  delays  of  the  combina¬ 
tional  blocks  are  such  that  valid  signals  reach  a  latch  in  time  for  a  transition  in  the  signal 
clocking  the  latch.  In  other  words,  the  worst-case  delay  of  each  combinational  stage  must 
be  restricted  to  be  below  a  certain  specification. 
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The  problem  of  continuous  sizing,  in  which  transistor  sizes  are  allowed  to  vary  con¬ 
tinuously  between  a  minimum  size  and  a  maximtim  size,  has  been  tackled  by  several 
researchers  [14-21].  The  problem  is  most  often  posed  as  a  nonlinear  optimization  prob- 
lem,  with  nonlinear  programming  techniques  used  to  arrive  at  the  solution.  The  solutions 
found  by  these  techniques  are  then  rounded  to  the  nearest  integer.  The  continuous  model 
works  well  for  sizing  transistors  in  a  full-custom  layout,  but  does  not  work  well  for  de¬ 
signing  with  macrocells  and  standard  cells,  for  which  only  a  small  number  of  choices  axe 
available.  Transistor  sizing  on  a  gate  array,  where  transistor  sizes  have  to  be  multiples 
of  the  standard  transistor,  is  also  poorly  realized  by  the  continuous  model. 

A  related  problem  that  has  received  less  attention  is  that  of  discrete  or  library-specific 
sizing.  In  this  problem,  only  a  limited  number  of  size  choices  are  available  for  each  gate. 
This  corresponds  to  the  scenario  in  which  a  circuit  designer  is  permitted  to  choose  gate 
configurations  for  each  gate  type  from  within  a  standard-cell  library.  This  problem  is 
essentially  a  combinatorial  optimization  problem  and  has  been  shown  to  be  NP-complete 
(221. 

In  this  chapter,  we  present  a  new  algorithm  for  solving  the  gate-sizing  problem  for 
combinational  circuits  that  takes  into  consideration  the  variations  of  gate  output  capac¬ 
itance  with  gate  resizing.  There  are  three  phases  in  our  algorithm.  In  the  first  stage, 
the  gate-sizing  problem  is  formulated  as  a  linear  program.  The  solution  of  this  linear 
program  provides  us  with  a  set  of  gate  sizes  that  does  not  necessarily  belong  to  the  set 
of  adlowable  sizes.  Therefore,  in  the  second  phase,  we  move  from  the  linear  program 
solution  to  a  set  of  allowable  gate  sizes,  using  heuristic  techniques.  In  the  third  phase, 
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we  f^her  fme-tune  the  solution  to  guarantee  that  the  delay  constraints  are  satisfied. 
Finally,  to  illustrate  the  efficacy  of  our  algorithm,  we  present  a  comparison  of  the  results 
of  this  technique  with  the  solutions  obtained  by  simulated  annealing  as  well  as  by  our 
implementation  of  the  algorithm  in  [23]. 

It  is  worth  mentioning  that  rounding  solutions  of  the  linear  program  to  the  nearest 
available  sizes  may  not  produce  good  solutions.  In  a  tightly  constrained  problem,  round¬ 
ing  continuous  sizes  to  the  nearest  discrete  size  may  not  even  give  a  feasible  solution. 
The  only  reason  that  the  continuous  model  works  so  well  for  transistor  sizing  is  that  the 
performance  measures  are  rather  insensitive  to  small  changes  in  transistor  sizes,  and  the 
steps  between  possible  sizes  are  small  compared  to  the  sizes  themselves.  In  our  problem, 
however,  the  change  between  each  step  is  large.  Consequently,  the  solution  obtained  by 
rounding  a  linear  program  solution  may  violate  timing  constraints,  or  the  objective  value 
may  be  much  larger  than  the  optimal  solution.  Therefore,  a  more  sophisicated  algorithm 
is  needed  to  handle  the  problem. 

This  chapter  is  organized  as  follows.  We  bri^y  describe  previous  approaches  to  the 
discrete  gate-sizing  problem  in  Section  2.2.  Then  we  describe  the  linear  programming 
approach  that  we  propose  in  Section  2.3,  followed  by  two  postprocessing  phases  described 
in  Sections  2.4  and  2.5.  Experimental  resixlts  are  given  in  Section  2.6.  Finally,  we  conclude 
this  chapter  in  Section  2.7 
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2.2  Previous  Work 


Chan  [22]  proposed  a  solution  to  the  problem  that  was  based  on  a  branch-and-bound 
strategy.  The  algorithm  is  exact  for  Boolean  tree  networks.  For  general  networks  that 
are  not  tree-structured,  a  backtracking-based  algorithm  is  proposed  for  finding  a  feasible 
solution.  The  algorithm  for  solving  the  optimal  discrete  sizing  problem  on  a  Boolean 
tree  network  consists  of  two  phases.  In  the  first  phase,  timing  requirements  for  each 
vertex  in  the  network  are  generated  and  propagated  through  the  network.  All  of  the 
timing  requirements  at  the  fan-in  of  each  vertex  are  intersected  to  prune  infeasible  timing 
requirements  of  the  vertex’s  predecessors.  In  the  second  phase,  backward  substitution 
is  used  to  assign  optimal  sizes  to  each  vertex  to  minimize  the  total  cost.  For  general 
OAGs  (directed  acyclic  graph),  a  cloning  procedure  is  used  to  convert  the  DAG  into  an 
equivalent  tree,  whereby  a  vertex  of  fan-out  m  is  implicitly  duplicated  m  times,  followed 
by  a  reconciliation  step  in  which  a  single  size  that  satisfies  the  requirements  on  2dl  of  the 
cloned  vertices  is  selected.  As  pointed  out  in  [24],  this  procedure  does  not  necessarily 
provide  the  optimal  solution  for  a  general  DAG;  moreover,  this  algorithm  is  of  exponential 
complexity  in  the  worst  case. 

The  approach  of  Lin  et  al.  [23]  uses  a  heuristic  algorithm  that  is  an  adaptation  of 
the  TILOS  algorithm  [15]  for  continuous  transistor  sizing,  with  further  refinements.  The 
approach  is  based  on  a  greedy  algorithm  that  uses  two  measures  known  as  sensitivity 
and  criticality  to  determine  which  cell  sizes  are  to  be  changed.  The  sensitivity  of  a 
cell  indicates  how  much  local  delay  per  unit  area  can  be  decreased  if  we  pick  another 
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template  for  this  specific  cell,  while  criticality  tells  us  whether  a  cell  hats  to  be  replaced  by 
a  larger  template  to  fulfill  the  delay  constraints  of  the  drcxiit.  A  weighted  sum  of  a  cell’s 
sensitivity  and  criticality  is  used  to  guide  the  algorithm  to  select  a  certain  number  of  gates 
to  be  replaced  with  a  different  template.  At  the  beginning  of  the  algorithm,  all  cells  in 
the  circuit  are  set  to  their  niinimum  sizes.  The  algorithm  consists  of  a  series  of  iterations, 
each  iteration  in  turn  having  two  phases.  In  the  first  phase  {incrtasing  phase),  a  quantum 
number  of  cells  are  replaced  with  larger  templates,  such  that  the  delay  constraints  can 
be  satisfied.  After  the  timing  constraunts  are  satisfied,  in  the  second  phase  {decreasing 
phase),  a  quaTttum  number  of  ceils  are  replaced  with  templates  with  smadler  cell  areas 
to  reduce  total  circuit  area.  The  value  of  quantum  is  determined  emperically  and  is 
reduced  by  one-half  over  each  iteration.  The  iteration  continues  until  quantum  becomes 
I  or  no  improvement  is  possible.  However,  while  the  TILOS  algorithm  is  known  to  work 
reasonably  well  for  the  continuoais  sizing  case,  the  primary  reason  for  its  success  is  that 
the  change  in  the  circuit  in  each  iteration  is  very  small.  On  the  other  hand,  in  the  discrete 
sizing  case,  any  change  must  necessarily  be  a  large  jump,  and  a  TILOS-like  algorithm  is 
likely  to  give  very  suboptimal  results. 

Another  algorithm  proposed  by  Li  et  al.  [24]  is  exact  for  series-parallel  circuits.  A 
simple  parallel  circuit  is  a  basic  circuit  that  is  comprised  of  several  chains  that  have  the 
same  first  and  last  module.  A  series-parallel  circuit  is  a  basic  circuit  recursively  defined 
as  [24]: 

•  A  chain  of  basic  modules  is  a  series-parallel  circuit. 
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•  A  simple  penllel  circuit  is  a  series*parallei  circuit. 

•  A  circuit  obtained  from  a  series-parallel  ciroiit  C  by  replacing  any  interconnect  of 

C  by  another  series-parallel  circuit  is  also  a  series-parallel  circuit. 

• 

The  algorithm  uses  a  dynamic  programming  technique  to  find  solutions  for  a  chain  of 
modules.  For  a  simple  parallel  circuit,  a  number  of  transformations  are  repeated  to  obtain 
the  optimal  implementation.  Finally,  the  optimal  implementation  of  any  series-parallel 
circuit  is  obtained  by  repeatedly  using  the  chain  and  simple  parallel  circuit  transformation 
on  subcircxiits  of  the  given  series-parallel  circuit.  This  work  is  extended  to  nonseries- 
parallel  circuits,  whose  structures  are  represented  by  general  DAGs,  and  several  heuristic 
techniques  are  used  in  conjunction  with  the  algorithm,  but  no  guarantees  on  optimadity 
are  made  for  such  circuits.  Moreover,  their  algorithm  docs  not  consider  the  capaticances 
of  fan-out  modules  (gates).  Therefore  the  results  may  not  be  accurate  since,  in  reality, 
the  gate  delay  is  a  fxmction  of  the  fan-out  gate  sizes  as  well. 

Both  of  the  above  two  approaches  [23,24]  are  heuristics,  and  hence  no  concrete  state¬ 
ments  can  be  made  on  how  close  their  solutions  are  to  the  optimal  solution.  Moreover, 
neither  work  shows  comparisons  with  a  technique  such  as  simulated  annealing  [25]  that 
is  known  to  give  optimal  or  near-optimal  solutions. 

The  algorithm  proposed  in  [26]  does  use  simulated  annealing;  however,  since  simulated 
annealing  is  computationally  expensive,  a  technique  for  variable  pruning  is  used  by  this 
algorithm  to  reduce  the  computational  complexity.  An  initial  configuration  is  obtained 
using  an  algorithm  similar  to  TILOS  [15].  The  set  of  gates  that  are  left  at  minimum  size  at 
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the  end  of  this  algorithm  are  eliminated  from  the  parameter  space,  under  the  assumption 

that  these  cells  would  not  be  sized  in  the  final  configuration.  The  sizes  of  the.  remaining 

cells  are  determined  using  a  simulated  annealing  algorithm.  One  argument  against  such 

* 

an  algorithm  is  that  it  would  have  very  large  runtimes  for  tight  timing  specifications,  in 
which  a  large  number  of  cells  would  be  sized  by  the  TILOS-like  heuristic. 


2.3  Problem  Formulation 

For  a  combinational  circuit,  the  discrete  gate-sizing  problem  is  formulated  as 

minimize  Arta 

subject  to  Dday  <  (2.1) 

Alternatively,  we  can  formulate  the  following  problem: 

minimize  Delay 

subject  to  Area  <  A^  (2.2) 

In  this  chapter,  we  concentrate  on  the  first  problem,  although  the  same  algorithm 
can  be  applied  to  the  second  problem  with  minor  changes. 

2.3.1  Formulation  of  delay  constraints 

The  delay  of  a  gate  in  a  standard-cell  library  can  be  characterized  by 

Ru 

delay  ss  x  Co*«  -H  r  ss  —  x  Cout  +  Ti  •  u;  +  Tz  (2.3) 
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Figure  2.2  An  example  illustrating  calculation  of  the  output  load  capacitance  of  a  gate. 

where  i2o«u  is  the  equivalent  resistance  of  the  gate,  Co«t  is  the  load  capacitance  of  the  gate, 
r  is  the  intrinsic  delay  of  the  gate,  represents  the  on>resistance  of  a  unit  transistor, 
and  Wi  is  called  the  nominal  gate  size  of  y,-.  Therefore,  the  size  of  each  gate  can  be 
parameterized  by  a  number,  to,  referred  to  as  the  (nominal)  gate  size. 

The  output  load  capacitance  of  a  gate  can  be  calculated  by  summing  the  gate  terminal 
capacitances  of  its  fan-out  gates  and  interconnect  wiring  capacitance,  assuming  that 
layout  information  is  given.  For  the  time  being,  we  ignore  the  interconnect  capacitance. 
In  Chapter  5,  we  will  discuss  how  to  combine  layout  information  with  our  formulation 
to  obtain  more  accurate  results. 

Consider  a  gate  Gi  which  fans  out  to  several  gates  including  gate  Gj,  da  shown  in 
Figure  2.2.  The  output  node  of  logic  gate  i  is  connected  to  the  n-type  transistor  Uj,  and 
p-type  transistor  pji  of  logic  gate  Gj.  Let  the  transistor  riij  have  the  geometry  as  shown 
in  Figure  2.3. 

As  illustrated  in  Figure  2.3,  the  parameter  L  stands  for  the  length  of  the  channel, 
xTiji  is  the  channel  width  of  transistor  n^j,  and  and  d,  refer  to  the  lengths  of  the  drain 
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Figure  2.3  Top  view  of  the  geometry  of  a  typical  transistor. 


and  source  terminals.  The  gate  tenninal  capacitance,  of  the  transistor  rtji  can  be 
expressed  as 

Cj  »  Cgta  •  L  '  xriji  +  2  •  Cgtp  •  (■^<  +  xriji)  (2.4) 

where 

Cqta  '  Gate  terminal  area  capacitance  (pF//tm^) 

Cgtp  :  Gate  terminal  perimeter  capacitance  {pF/nm) 

Since  the  channel  length,  L,  of  transistors  in  a  typical  standard-cell  library  is  fixed, 
the  output  load  capacitance  of  logic  gate  t  with  respect  to  logic  gate  j  can  be  expressed 
as 

cap(i,j)  -  Ki  •  xriji  -I-  Kj  (2.5) 

where 

Ki  = 

K2  = 


Cqta  '  L +  2’  Cgtp 

2  •  Cgtp  •  L  (2.6) 


In  general,  the  gate  terminal  capacitances  of  a  certain  transistor  in  different  versions 
of  a  logic  gate  may  not  be  linearly  proportional  to  the  nominal  size  of  that  logic  gate. 
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Figure  2.4  Appradmating  gate  terminal  capacitance  by  an  afSne  function. 


For  example,  Figure  2.4  shows  a  typical  plot  of  the  gate  terminal  capacitance  of  a  certain 
transistor  with  respect  to  different  sizes  of  a  logic  gate.  Inspite  of  this,  however,  we  can 
approximate  the  data  points  by  an  affine  function  using  linear  least-squares  approxima¬ 
tion,  as  shown  in  the  figure.  In  other  words,  the  output  load  capacitance  of  logic  gate  t 
as  seen  by  logic  gate  j  is 

j)  *  Oii  •  zi  +  I3ij  (2.7) 

where  Zj  is  the  size  of  logic  gate  j. 

Therefore,  the  output  load  cjq>acitance  of  gate  i  can  be  found  to  be 

Cout  =  cop(*i  1)  +  cop(*»  2)  -I-  •  •  •  +  cop(i,  /) 

—  •  zi  -K  0n  -1-  Oia  •  za  +  A'3  +  *  ’  •  +  Oif  -  Zf  +  fiij  (2.8) 

where  zi,  za, . . . ,  z/  are  the  sizes  of  the  cells  which  logic  gate  t  fans  out  to. 

Thus,  the  delay  function  D{ya)  of  gate  i  with  nominal  size  to  can  be  represented  as 

P 

D(w)  - - Cout  +  n  •  to  -I-  ra 

to 
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Therefore  the  delay  of  a  cell  is  a  sum  of  functions  of  g{w,  z)  —  zfw  and  h{w)  =  l/w. 
Figure  2.5  shows  surface  plots  of  the  function  z/w.  Since  the  function  g{w,  z)  s:  z/w  is 
relatively  smooth,  it  can  be  approximated  by  a  convex  piecewise  linear  function  with  q 
regions  of  the  form 


a\-w  +  bi-  z  +  Cl  {w,  z)  6  Region  Ri 

03  •  w  +  •  z  +  C2  (w,  z)  6  Region  Rj 

PWL{w,z)  =  (2.10) 

o,  •  tn  +  5,  •  z  +  c,  (tn,  z)  6  Region  R, 

*  mj«  (a,- •  w  +  &j  •  z  +  c,)  V(x,y)€  U  R,  (2.11) 

i<t<« 

The  second  equality  follows  from  the  first  since  PWL{w,z)  is  convex. 
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Figure  2.6  Approximation  of  the  function  l/tt;  by  a  piecewise  lineau'  function. 

The  function  1  /w  is  shown  in  Figure  2.6.  Similarly,  we  can  approximate  the  function 
h{w)  a  l/tn  with  a  convex  piecewise  linear  function  of  the  form 


pwl{x)  = 


di  •  tw  +  Cl 

w  6  Region  ri 

dj  •  U7  +  Cj 

w  6  Region  rj 

d,  •  to  +  e. 

w  €  Region  r. 

(2.12) 


{dj  •  w  +  e,)  V  u;  6  y  r< 


(2.13) 


Therefore,  the  gate  delay  D{w,zi, . . . ,  z/)  of  a  gate  with  size  w,  amd  fam-out  gate  sizes 
z\---  Zj  can  be  represented  using  a  convex  piecewise  linear  function  with  q  regions,  as 
follows: 
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* 

Gi  •  W  +  ^,1  •«!  +  •••  +  +  Cl  -f«  Ti  •  tl»  +  Tj 

dj  •  w  +  ia,!  •«!  +  •••  +  +  cj  +  n  '  +  rj 


(w,  zi  •  •  •  z/)  €  Region  Ri 

(to,  Zi  • ' '  z/)  €  Region  Ra 

(2.14) 


^  d,  •  tz  +  d,,i  •  zi  +  •  •  •  +■  +  Cf  +  Ti  •  tz  +  Tj 

s=  max  (d<  •  tz  +  Ua  •  +  •  *  *  Uj^/  +  c,)  +  n  •  tz  +  rj 

1$*<4 


(tz,  zi '  •  •  z/)  €  Region  R, 
V(tz,zi---z/)  €  U  Ri 

l<i<9 


2.3.2  Formulation  of  the  linear  program 

The  formal  definition  of  the  gate-sizing  problem  for  a  combinational  circuit  is  as 
given  in  (2.1).  Since  the  objective  function,  namely,  the  area  of  the  circuit,  is  difficult  to 
estimate,  we  approximate  it  as  the  sum  of  the  gate  sizes,  as  has  been  done  in  almost  all 
work  on  sizing  [14-21]. 

Similarly,  in  general,  the  cell  area  of  a  logic  gate  may  not  be  linearly  proportional  to 
the  cell  size,  as  shown  in  Figure  2.7.  Nevertheless,  we  can  approximate  those  data  points 
by  an  affine  function  using  lineu  least-squares  approximation  as  shown  in  the  figure. 

Therefore,  the  cell  area  of  a  gate  i  can  be  expressed  as 

area(t)  =  •  tz,-  -4- (2.15) 

where  tz,-  is  the  nominal  size  of  gate  t. 

The  delay  specification  states  that  all  path  delays  must  be  bounded  by  Ttpee-  Since  the 
number  of  PI-PO  paths  could  be  exponential,  the  set  of  constraining  delay  equations  could 
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Figure  2.7  Approximating  gate  area  by  an  afBne  function. 

potentially  be  exponential  in  the  ntimber  of  gates;  unless  certain  additional  variables,  m^, 
t  s  1  <  •  •  A/  (where  M  is  the  number  of  gates),  are  introduced  to  reduce  the  number  of 
constraints.  The  worst-case  signal  arrival  time  rm  corresponds  to  the  worst-case  delay 
from  the  primary  inputs  to  gate  i.  Using  these  variables,  for  each  gate  i  with  delay  d,, 
we  have 

m,-  =  max{mj  -I-  dj  |  V  j  €  Fomn(t)}  (2.16) 

where  Famn(t)  is  the  set  of  fan-in  gates  of  gate  t.  Equivalently,  we  have 

+  ‘ii  ^  V  j  €  FomT»(*).  (2.17) 

This  reduces  the  number  of  constraining  equations  to  Fanin{i),  which,  for  most 
practical  circuits,  is  of  the  order  0{M). 

For  example,  consider  a  part  of  a  circuit  as  shown  in  Figure  2.8.  Gates  1,  2,  emd  3 
fan  out  to  gate  4.  The  worst-case  signal  arrival  times  of  gate  1,  2,  and  3  are  4.5,  3.0,  and 
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Figure  2.8  An  example  illustrating  the  definition  of  rrii. 

3.5,  respectively.  The  gate  delay  of  logic  gate  4  is  0.3.  Then  the  worst-case  signal  arrival 
time  of  gate  4  is  s  max(4.5,  3.0,  3.5)  -{-  0.3  s  4.8. 

We  now  formulate  the  linear  program  as 

M 

minimize  ^  7<  •  u;,- 

iml 

subject  to  For  all  gates  i  a  1  •  •  •  M 

+di  <mi  V  j  €  Fanin{i) 

^  S  T,pte  V  gait  i  at  P0*3  (2-18) 

di  >  b{wu  roi,u . . . 

Wi  >  Minsize{i) 

Wi  <  Max3ize(i) 

where  . . .  u;{,/o(j)  are  the  sizes  of  the  gates  to  which  gate  i  fans  out,  and  Min3ize{i) 
and  Max3ize{i)  are  the  minimum  and  maximum  sizes  of  gate  i  in  the  library,  respectively. 
Notice  that  in  the  objective  function,  the  constant  term  in  (2.15)  is  omitted  since  it  does 
not  affect  the  result. 
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Tbe  preceding  is  a  linear  program  in  the  variables  w,-,  tk,  m^.  It  is  worth  noting  that 
the  entries  in  the  constraint  matrix  are  very  sparse,  which  makes  the  problem  amenable 
to  fast  solution  by  sparse  linear  program  approaches.  Notice  that  the  equalities  of  (2.14) 

I 

are  replaced  here  by  inequalities  so  as  to  satisfy  (2.15). 

It  should  be  emphasized  that  our  approach  is  aJ>le  to  handle  different  timing  spec¬ 
ifications  at  different  primary  outputs.  However,  for  the  sake  of  simplicity,  we  use  the 
same  timing  specification  for  all  of  the  primary  outputs  in  the  circuit. 


2.4  Phase  II  :  The  Mapping  Algorithm 

The  set  of  permissible  sizes  for  gate  i  is  Si  ^  where  p,-  is  the  cardinality 

of  5,-.  The  solution  of  the  linear  program  would,  in  general,  provide  a  gate  size,  tu,  ,  that 
does  not  belong  to  J,-.  If  so,  we  consider  the  two  permissible  gate  sizes  that  are  closest  to 
u;{;  we  denote  the  nearest  larger  (smaller)  size  by  Wi+  (wi^).  Note  that  in  any  standard 
cell  library,  Wi+  has  a  smaller  delay  than  tn,--.  Since  it  is  reasonable  to  assume  that 
the  LP  solution  is  close  to  the  solution  of  the  combinatorial  problem,  we  formulate  the 
following  smaller  problem: 

For  all  t  =  1  •  ’  •  Af  :  Select  u;,-  =  u>,+  or  u;,_, 

such  that  Delay  <  Tspec 

Although  the  complexity  has  been  reduced  from  0(nili  Pi)  for  the  original  problem 
to  0(2^),  this  is  still  an  NP-complete  problem.  In  this  section  we  present  an  implicit 
enumeration  algorithm  for  mapping  the  gate  sizes  obtained  using  linear  programming 
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onto  permissible  gate  sizes.  The  algorithm  is  based  on  a  breadth-first  branch-and*bound 
approach. 

It  is  worth  pointing  out  that  the  solution  to  this  problem  is  not  necessarily  the  optimal 
solution;  however,  it  is  very  likely  that  the  final  objective  function  value  for  a  solution 
arrived  at  using  good  heuristics  will  be  close  to  the  linear  program  solution,  and  hence 
close  to  the  optimal  solution.  This  supposition  is  borne  out  by  the  results  presented  in 
Section  2.6. 

In  Section  2.4.1,  we  present  an  implicit  enumeration  mapping  algorithm  which  is 
single-path  oriented.  Although  the  execution  time  is  fast,  the  result  may  not  be  satisfy¬ 
ing.  Therefore,  in  Section  2.4.2,  we  propose  an  improved  implicit  enumeration  mapping 
algorithm  using  a  global  approach. 

2.4.1  Implicit  enumeration  approach 

The  algorithm  first  places  all  M  gates  in  a  queue,  Q,  in  decreasing  order  of  their 
worst-case  signal  arrival  time,  m,-.  The  longest  path,  P,  from  any  PI  to  the  gate  at  the 
head  of  Q  is  found.  The  unmapped  gates  along  P  are  mapped  to  permissible  gate  sizes 
using  an  implicit  enumeration  approach  [27].  Once  a  gate  size  has  been  mapped  onto  a 
permissible  size,  it  is  said  to  be  processed,  and  remains  unchanged  during  the  remainder 
of  the  enumeration  process.  A  processed  gate  is  removed  from  the  queue  Q. 

After  P  has  been  processed,  the  process  is  repeated  for  the  longest  path  to  the  gate 
that  is  now  at  the  head  of  Q,  until  Q  is  empty.  Thus,  although  the  circuit  could  have 
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an  exponentially  large  number  of  paths,  our  algorithm  has  to  handle  at  most  of  those 
paths. 

Let  G\  be  the  gate  that  is  currently  at  the  head  of  the  queue.  Let  P  =  ,  G\p\ 

be  the  longest  path  from  any  PI  to  gate  where  |P|  b  the  number  of  gates  on  the 

path.  The  order  of  gates  on  the  path  b  such  that  Gi  fans  out  to  Gi~.i,  2  <  t  <  |P|.  The 

prtdeeessor  (axteeessor)  of  gate  Gi  on  the  path  P  b  the  gate  Note  that  GjPj 

has  no  predecessor  and  Gj  has  no  successor. 

Starting  from  Gi,  we  form  a  state-space  tree.  Each  node  at  level  i  in  the  state-space 

tree  is  a  cell  configwation,  which  represents  a  possible  realization  of  gate  Gi.  To  help 

define  a  cell  configuration,  we  introduce  the  following  notation.  Let 
C(t,  j)  :  the  jth  node  at  level  t, 

anc{i^j)  :  the  ancestor  node  of  C(>«y}, 

FO{i)  :  the  set  of  gates  that  gate  t  fans  out  to, 

area(t,  U7i)  :  the  cell  area  of  gate  i  when  its  size  is  Wi,  area{i,  Wi)  —  7,-  *  Wi  (see  (2.18)), 

:  the  equivalent  resistance  of  gate  i,  corresponding  to  size  Wi, 
that  drives  its  load  capacitances,  iiUt(u;j)  s  R^/wi  (see  (2.2)), 
r*(u;j)  :  cap{i,j),  given  that  gate  j  b  the  predecessor  of  gate  i  on  paith  P, 

and  the  size  of  gate  j  b  wj. 

Definition  2.1  A  cell  configuration,  C(t,j)  is  a  triple  {Wij,Aij,Dii), 

Wij  —  Wc{ij)  € 

Aij  —  *  <irCo(t,  IVij)  +  A^ne(iJ)t 

Dij  =  Dc(ij)  =  dij  +  Dtntii^), 
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when  dii  »  llL,(Wii)  ■  Y,  “IK*'  *)  + 

where  Aij  is  the  aceumidated  area  from  the  toot  to  C(t,  j),  Dii  is  the  aeeumulated  delay 
from  the  root  to  and  dij  is  the  eonfiguraiion  delay  associated  with  C(i,j).  Phys¬ 

ically,  dij  corresponds  to  the  delay  of  gate  t,  pven  that  gate  i  has  size  Wij,  and  gate 
(»  -  1)  has  size  W^ij). 

In  the  state-space  tree,  each  node  has  no  more  than  two  successors  since  there  are  at 
most  two  choices  for  the  gate  size.  Every  node  in  the  tree  corresponds  to  an  assignment 
of  sizes  to  those  gates  which  lie  on  the  path  from  the  tree  root  to  that  node. 

The  root  of  the  tree  is,  by  definition,  assigned  a  null  cell  configuration  (0,0,0).  We 
begin  with  the  unprocessed  gate  on  the  current  path,  P,  that  is  closest  to  the  POs,  and 
implicitly  enumerate  the  two  possible  realiziUions  of  each  gate  i,  Wi+  and  u;,-...  The  delay 
of  each  gate  is  dependent  on  its  own  size  and  on  the  size  of  the  gates  that  it  fans  out  to. 
Therefore,  once  Gi  has  been  enumerated,  the  delay  associated  with  the  predecessor  of  G, 
on  path  P  can  be  calculated,  and  it  can  be  enumerated.  The  process  continues  until  all 
gates  along  P  have  been  processed. 

During  the  enumeration  process,  it  is  possible  to  eliminate  several  of  the  possibilities 
to  prune  the  search  space.  A  node  C{i,j)  with  a  cell  configuration  {WijjAij,Dij)  is 
bounded  if  there  exists  a  cell  configuration  {Wn,,  An,,  Dn,)  at  the  same  level  of  the  tree 
such  that 

(1)  area{i,  Wn,)  <  area(x,  Wi,),  An,  <  An  and  Dn,  <  Dij,  or 
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Figure  2.9  An  example  illistrating  the  construction  of  a  state-space  tree  in  the  mapping 
algorithm. 


(2)  area(i,  Wn,)  <  area(i,  Wa),  An,  <  Aij  and  Du,  <  Dij. 


Example  2.2  In  Figure  2.9,  let  G\  be  the  current  head  of  the  queue,  Q.  Let  G2  be  the 
predecessor  of  Gi,  and  G3  that  of  G^  on  the  longest  path  from  a  PI  to  G\.  There  are 
two  possible  reahzations  for  Gi,  namely, 

(1)  one  with  area(l,IVi,i)  s  1.2  and  driay  dt,i  s  0.9,  and 

(2)  one  with  area(l,  Wi^)  s  0.8  and  delay  di,2  ss  l.l. 

If  neither  node  C(l,  1)  nor  C(l,2)  is  bounded,  we  proceed  to  construct  the  second  level 
for  both  cell  configurations.  The  two  successors  of  node  C(l,  1)  in  the  tree  represent  two 
possible  configurations  of  G3  ifG\  is  chosmi  to  be  of  the  size  with  area(l,  Wi,i)  —  1.2. 
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Further,  aode  C(2, 1)  represents  the  configuration  if  G2  is  chosen  to  have  a  template 
with  area(2,  —  1.5.  Here,  if  the  corresponding  configuration  delay  of  G%,  d^.i  =  0.8, 
then 


•  accumulated  delay  of  G\  and  Gt,  D^i  *  1«7 

•  accumulated  area  of  Gi  and  (?3,  As,!  ~  2.7 


Similarly,  node  C(2, 2)  represents  the  situation  if  G\  is  chosen  to  be  of  the  size  with  cell 
area  1.2  and  (7s  with  ceU  area  1.0.  If  the  configuration  delay  of  Grs,  ds,3  =  1-2,  then 

•  accumulated  delay  Ds4  —  2.1 

•  accumulated  area  As^  —  2.2 

The  entries  of  node  (7(2,3)  and  C(2,4)  can  be  calculated  similarly. 

Now,  notice  that  nodes  C(2, 1)  and  C(2, 3)  have  the  same  gate  area  for  Gs,  while  node 
C(2,3)  has  less  accumulated  area  and  accumulated  delay  than  node  G(2, 1).  Therefore, 
node  G(2, 1)  is  bounded,  and  it  is  not  necessary  to  enumerate  the  descendants  of  G(2, 1). 
Similarly,  G(2,2)  is  bounded  since  G(2,4)  has  superior  configuration  to  G(2,2).  □ 


For  every  path  P  in  the  circuit,  we  define  a  quantity  known  as  the  maximum  path 
delay,  (MPD),  as  follows: 


MPD{P)  =  { 


min  (mj  —  </,), 


if  gate  t  is  not  at  a  PO 
if  gate  t  is  at  a  PO 


(2.19) 
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where  gate  x  is  the  gate  that  lies  at  the  end  (rf  path  P.  Note  that  even  if  gate  i  is  at  a 
PO,  it  could  still  fan  out  to  other  gates  in  the  circuit;  this  is  reflected  in  the  definition  of 
the  M PD.  Mxodmum  path  delay  physically  corresponds  to  the  maximal  delay  that  can 
be  assigned  to  path  P  before  its  effect  is  propagated  beyond  gate  Gi  at  the  end  of  the 
path. 

After  the  state-space  tree  for  the  longest  path  P  has  been  constructed,  the  algorithm 
examines  the  cell  configurations  at  the  leaf  nodes  of  the  tree.  The  cell  configuration, 
C(|P|,n),  which  satisfies  the  following  requirements,  is  selected. 

(1)  x?in»  <  AfPH(P), 

(2)  D\p\,n  ^  ^  ^(1^1*0  ^  MPD{P). 

In  requirement  (2),  instead  of  using  A|Pt,A  <  as  the  criterion,  we  use  D\p\,n  > 
D|/>|,i.  This  is  because  we  do  not  want  to  perturb  the  solution  obtained  from  the  linear 
programming  too  much.  This  way,  it  is  expected  that  no  change  in  gate  size  takes  the 
circuit  delay  radically  away  from  T^p^. 

By  performing  a  trace-back  from  C(|Pi,n)  to  the  root  of  the  tree,  the  size  of  each 
gate  along  P  is  determined  from  the  cell  configurations  at  each  traversed  node  of  the 
tree. 

2.4.2  Global  implicit  enumeration  approach 

The  rationale  behind  our  global  enumeration  algorithm  is  based  on  the  following 
observation.  Given  the  solution  of  the  linear  programming,  the  majority  of  the  gates 
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remain  at  their  smallest  sizes.  Only  a  small  portion  of  the  gates  in  the  circuit  are  moved 
to  a  larger  size  because,  for  a  typical  circuit,  although  there  may  be  a  huge,  number  of 
long  paths,  the  number  of  gates  on  these  long  paths  is,  in  general,  relatively  small. 

Based  on  this  observation,  during  the  implicit  enumeration  procedure  we  may  ignore 
those  gates  which  are  assigned  to  have  thrir  smallest  size  by  the  solution  of  the  linear 
progranuning,  and  concentrate  on  those  gates  that  have  been  assigned  larger  sizes  and 
are  probably  on  long  paths. 

Definition  2.2  A  critical  gate  is  a  gate  whose  size  is  larger  than  its  smallest  possible 
size. 

Notice  that  the  determination  of  critical  gates,  in  general,  can  be  very  difficult  to 
obtain,  since  whether  a  gate  is  critical  or  not  heavily  depends  on  the  circuit  structure  as 
well  as  the  tightness  of  the  delay  bounds.  However,  using  an  analytical  approach  such  as 
linear  programming,  whether  a  gate  is  critical  or  not  can  be  determined  easily. 

We  modify  the  circuit  topology  by  adding  a  source  node  so  and  a  sink  node  si.  A 
dummy  edge  is  added  from  node  so  to  each  of  the  input  nodes  and  from  each  of  the 
output  nodes  to  the  node  si.  Next,  for  each  gate  i  we  define  max-delay~to-sink,  denoted 
by  mds(t),  to  be  the  maximum  of  the  delays  of  all  possible  paths  starting  from  gate  i  to 
the  sink  node  si  [28].  That  is, 

mds(i)  =  ^mM^{mds(j)  +  d,}  (2.20) 

The  method  for  finding  max-delay-to-sink  is  a  topological  sort.  That  is,  mds{i)  of 
a  gate  t  can  be  calculated  only  after  all  of  the  mds's  of  its  fan-out  gates  have  been 
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computed.  Therefore,  the  computation  of  mde’s  starts  from  sink  node  st  and  proceeds 
backwards  until  we  reach  the  source  node  so. 

A  breadth-first  search  is  applied  to  levelize  the  circuit  from  the  sink  node  backwards.^ 
The  level  of  a  gate  t  in  this  levelization  is  called  its  backward  draiit  level,  cJevel{i).  By 
definition,  the  backward  circuit  level  of  the  sink  node  st  is  0,  while  the  source  node  so 
has  the  largest  backward  circuit  level.  Starting  from  st,  we  form  a  state-space  tree  by 
implicitly  enumerating  critical  gates.  During  the  enumeration,  noncritical  gates  remain 
at  their  minimum  size  and  need  not  be  enumerated.  Each  level  in  the  state-space  tree 
corresponds  to  a  critical  gate.  The  corresponding  critical  gate  of  level  i  is  gate  k,  where 
k  s  .^(0*  Similarly,  the  corresponding  level  of  a  critical  gate  k  in  the  state-space  tree  is 
called  the  gate’s  tree  level,  tJevel{k).  Therefore  tJevel{^{i))  =  t.  Each  node  at  level  i 
in  the  state-space  tree  is  a  cell  configuration^  which  represents  a  possible  realization  of 
its  corresponding  gate.  Let  C(i,j)  denote  the  jth  node  at  level  t,  and  anc{i,j)  be  its 
ancestor  node. 

Definition  2.3  A  cell  configuration,  C{iyj)  is  a  triple  {Wij,  Aij,  Dij), 

IVti  =  W'ccij)  € 

Dii  =  Dc(ij)  =*  niox  {mds(k)},  where  k  is  a  gate  in  the  circuit  (not  necessarily  a 

critical  gate),  which  satisfies  cJevel{k)  —  cJevel{^{i))  +  2. 

^This  is  different  from  a  traditional  levelizing  scheme  which  is  done  starting  from  the  source  node 
and  proceeds  forwards. 
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where  Aij  is  the  accumulated  area  from  the  root  to  C(i,j).  (Notice  that  7^(,)  ■  Wij  is  the 
cell  area  of  gate  ^(i),  given  that  its  size  is  Wij.) 

In  the  state-space  tree,  each  node  has  no  more  than  two  successors  since  there  are 
at  most  two  choices  for  the  gate  size.  The  root  of  the  tree  is,  by  definition,  assigned 
a  null  cell  configuration  (0, 0, 0).  We  begin  with  the  critical  gate  that  has  the  smallest 
backward  circuit  level  and  implicitly  enumerate  the  two  possible  realizations  of  each  gate 
^(i)f  wx(i)+  ^d  The  delay  of  each  gate  is  dependent  on  its  own  size  and  on  the 

size  of  the  gates  that  it  fans  out  to.  Therefore,  once  g^^i)  has  been  enumerated,  the  delay 
associated  with  the  predecessor  of  g^(i)  can  be  calculated,  and  the  remaining  critical  gates 
can  be  enumerated.  During  the  enumeration  process,  it  is  possible  to  eliminate  several 
of  the  possibilities  to  prune  the  search  space.  A  node  C(i,j)  with  a  cell  configuration 
(Wij,Aij,Dij)  is  bounded  if  there  exists  a  cell  configuration  (Wi*,  at  the  same 

level  of  the  tree  such  that 

(1)  Aik  <  Aij  and  A*  <  A,,  or 

(2)  Aik  <  Aij  and  A*  <  Aj- 

After  all  of  the  critical  gates  have  been  implicitly  enumerated,  we  keep  calculating 
max-delay-to-sink  for  each  remaining  gate.  However,  since  noncritical  gates  have  fixed 
sizes,  no  enumeration  is  necessary.  Rather,  we  simply  propagate  the  values  toward  the 
source  node.  For  each  leaf  node  of  the  state-space  tree,  the  max-delay-to-sink  of  the  source 
node  corresponding  to  that  node  is  calculated  and  denoted  by  Aj-  The  cell  configuration 

there  is  more  than  one  critical  gate  which  has  the  same  backward  circuit  level,  one  of  them  is 
randomly  chosen. 
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which  has  the  largest  and  satisfies  Di^  <  T^  is  selected.  By  performing  a  trace- 
back  from  the  selected  leaf  node  to  the  root  of  the  tree,  the  size  of  each  critical  gate  is 
determined  from  the  cell  configurations  at  each  traversed  node. 

2.5  Phase  III  :  The  Adjusting  Algorithm 

After  the  mapping  phase,  if  the  delay  constraints  cannot  be  satisfied,  some  of  the  gates 
in  the  circuit  must  be  fine- timed.  For  each  PO  which  violates  the  timing  constraints,  we 
identify  the  longest  path  to  that  PO.  For  example,  if  gate  p  at  the  PO  has  a  worst  case 
signal  arrival  time  rrip  >  we  first  find  the  longest  path,  P,  to  Gp.  The  path  slack  of 
P  is  defined  as 

Pslack{P)^T^-mp  (2.21) 

For  each  gate  along  that  longest  path,  we  calculate  the  local  delay  difference  for  each 
of  the  gates  along  path  P.  Assume  that  are  consecutive  gates,  in  order  of 

precedence,  on  path  P.  The  local  delay  and  local  delay  difference  associated  with  G,  are 
defined  as 

delay{Gi)  =  (2-22) 

^delayiGi)  =  •  G^  (2.23) 

where  and  G^  are,  respectively,  the  equivalent  driving  resistance  of  gate  i,  <ind  the 
capacitive  load  driven  by  gate  i.  Therefore,  Adelay(Gi)  is  the  difference  between  the 
original  local  delay  of  G<  and  the  new  local  delay  of  G<  after  we  replace  it  with  a  different 
gate  size  that  has  a  different  value  of  and  G^/. 
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After  calculating  the  local  delay  difference  associated  with  each  of  the  gates  along 
path  P,  we  select  the  largest  one,  Ade/ay(Crn),  which  satisfies 

Adelay(Gn)<Pslack{P)  (2.24) 

I 

and  change  the  size  of  Gn  accordingly.  If  none  of  the  local  delay  differences  satisfy 
(2.24),  we  select  the  most  negative  one  and  replace  the  gate  with  a  new  realization.  This 
process  continues  until  the  delay  constraints  are  all  satisfied.  Also,  notice  that  unlike  in 
the  mapping  algorithm,  we  do  not  restrict  our  choices  to  ti;,>  and  tsi.  at  this  phase. 


2.6  Experimental  Results 

The  preceding  algorithms  were  implemented  in  a  program  GALANT  (GAte  sizing 
using  Linear  programming  ANd  heuricTics)  on  a  Sun  SparclO  station.  The  test  cir¬ 
cuits  include  several  ISCAS85  combinational  benchmark  circuits  [29].  Each  cell  in  the 
standard-cell  library  has  four  different  sizes  of  realization  with  different  driving  capabil¬ 
ities. 

To  prove  the  efficacy  of  the  approach,  a  simulated  annealing  algorithm  and  Lin’s 
algorithm  [23]  were  implemented  for  comparison.  The  parameters  used  in  Lin’s  algorithm 
have  been  tuned  to  give  the  best  overall  results.  The  simulated  annealing  algorithm  that 
we  have  implemented  is  similar  to  that  described  in  [26].  However,  unlike  in  [26],  all  gate 
sizes  were  allowed  to  change  during  the  simulated  annealing  procedure;  while  the  nm- 
times  for  this  procedure  were  extremely  high,  the  solution  obtained  can  safely  be  said  to 
be  close  to  optimal.  Although  simulated  annealing  does  not  guarantee  the  global  optimal 
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solution,  a  well-designed  algorithm  and  a  very  slow  annealing  procedure  can  provide  a 
solution  that  is  very  close  to  the  global  optimum. 

The  restilts  of  our  approach,  in  comparison  with  Lin’s  algorithm  and  simulated  ein- 
nealing,  are  shown  in  Table  2.1.  The  test  circuits  include  five  ISCAS85  benchmarks, 
and  vary  in  size  from  160  gates  (824  transistors)  to  3512  gates  (15,396  transistors).  It 
can  be  seen  that  the  accuracy  of  the  results  of  our  approach  ranges  from  being  as  good 
as  simulated  ^nn^Aling  for  c432  to  a  discrepancy  of  less  than  2.0%  in  comparison  with 
simulated  a.nn#»Aling.  The  average  discrepancy  is  less  than  1.0%,  auid  the  run  times  are 
considerably  smaller  than  those  for  simtilated  annealing. 

Although  Lin’s  algorithm  runs  much  faster  than  GALANT,  it  does  not  always  provide 
good  results.  For  loose  timing  constraints,  its  solution  is  comparable  to  the  result  ob- 
t^ned  using  GALANT.  For  somewhat  tight  specifications,  however,  its  solution  becomes 
excessively  pessimistic.  For  even  tighter  delay  constraints,  it  cannot  obtain  a  solution 
at  all.  As  mentioned  previously,  Lin’s  algorithm  essentially  is  2tn  adaptation  of  the  TI- 
LOS  edgorithm  [15]  for  continuoiis  transistor  sizing,  with  a  few  mhancements.  While  the 
TILOS  algorithm  is  known  to  work  reasonably  well  for  the  continuous  sizing  case,  the 
primary  reason  for  its  success  is  that  the  change  in  the  circuit  in  each  iteration  is  very 
small.  However,  in  the  discrete  sizing  case,  any  change  must  necessarily  be  a  large  jump, 
and  a  TILOS-like  algorithm  is  likely  to  give  very  suboptimal  results. 

Table  2.2  shows  the  amount  of  time  taken  by  the  mapping  and  adjusting  algorithm  in 
comparison  with  the  time  required  to  solve  the  linear  program,  for  some  of  the  results  in 
Table  2.1.  It  is  clear  that  for  all  circuits,  the  chief  component  (over  95%)  of  the  runtime 
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Table  2.1  Performance  comparison  of  GALANT  with  Lin’s  algorithm  and  simulated 
annealing. 


Circuit 

Simulated  Annealing 

GALANT 

Lin’s  Algorithm 

Area 

(Asa) 

Run  time 

Area 

(Ac) 

Run  time 

Asa 

Area 

(At) 

Run  time 

Asa 

e432 

2372 

19mm  53e 

4.82s 

0.10s 

H 

2515 

21min  17s 

5.38s 

1.000 

0.15s 

2950 

24miii27s 

2983 

7.72s 

1.011 

nn 

- 

mi 

cl355 

8276 

31i  32mui 

8276 

Imin  13s 

8536 

■ii 

BSI 

9258 

3h  45min 

9412 

2mm  14s 

10319 

mm 

wm 

■vtI 

10224 

4h  12mm 

10417 

3min  32s 

- 

mm 

mi 

c2670 

17623 

Sh  22mui 

17623 

4mm  12s 

1.000 

11.21s 

Hgi 

17772 

5h  42mm 

in90 

4mm  30s 

1.001 

19.7s 

KM 

18929 

8h  12mm 

19079 

7mm  8s 

1.008 

m 

- 

mi 

c5315 

KjfJ 

13h  46mm 

36954 

llmin  52s 

1.001 

2.20s 

1.012 

14h  ^min 

37457 

17mm  28s 

1.001 

4.32s 

1.102 

E& 

i  38618 

14h  43mm 

38863 

19min2s 

1.006 

mi 

- 

- 

c7552 

wgm 

221i  Smin 

50604 

35mm  49s 

mmm 

51100 

9.54s 

wwm 

23h  20mui 

51254 

52min  27s 

53772 

34.57s 

16.0 

52069 

241i  Smin 

52563 

Ih  llmin 

- 

- 

_ 

mil 

Avenge  Area  Ratio 

1.0057 

' 

was  the  linear  programming  algorithm;  the  heuristic  was  extremely  fast  in  comparison. 
The  discrepancy  between  the  sum  of  LP  solution  time  and  the  time  required  for  mapping 
and  adjusting  in  Table  2.2,  and  the  total  runtime  in  Table  2.1  is  attributable  to  the 
preprocessing  step  which  performs  miscellaneous  administrative  steps  such  as  reading  in 
the  circriit  description  and  levelizing  the  circuit. 


Table  2.2  Execution  times  for  the  Linear  Program  and  the  Mapping  and  Adjusting 
Algorithms. 


Circuit 

#  of  gates 

7W 

LP  solution 

Mapping  and  Adjusting 

c432 

160 

12.0 

6.988 

0.758 

cl355 

546 

14.0 

Imin  4s  . 

7.338 

c2670 

1193 

14.0 

6min  508 

13.688 

c5315 

2307 

17.0 

ISmin  298 

32.51s 

c7552 

3512 

16.0 

Ih  lOmin 

Imin  21s 

A  comparison  of  the  run-times  for  GALANT,  Lin’s  algorithm,  and  simulated  anneal¬ 
ing  on  the  circuit  c432,  for  various  timing  specifications,  is  shown  in  Table  2.3  and  is 
plotted  in  Figure  2.10.  It  is  clear  that  GALANT  is  orders  of  magnitude  faster  than  sim¬ 
ulated  annealing,  with  results  of  comparable  quality.  It  can  be  seen  that  as  the  timing 
specification  becomes  tighter,  the  area  increases;  the  increase  in  area  is  very  rapid  for 
tighter  timing  specifications.  In  all  cases,  the  solution  obtained  by  GALANT  is  very  close 
to  the  solution  obtained  by  simulated  annealing.  In  comparison  with  the  residts  of  Lin’s 
algorithm,  we  find  that  GALANT  provides  results  of  substantially  better  quality,  with 
reasonable  nm-times. 

» 

The  runtime  of  GALANT  is  seen  to  go  up  as  the  timing  specifications  become  tighter. 
This  can  be  ascribed  to  the  fact  that  there  are  many  more  solutions  of  the  linear  program 
that  are  close  to  the  optimal  solution,  and  hence  the  simplex  procedure  takes  a  longer 
time.  This  is  in  contrast  with  the  case  for  a  loose  timing  specification,  in  which  most 
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Table  2.3  Performance  comparison  of  GALANT  with  Lin’s  algorithm  and  simulated 
annealing  for  c432. 


Circuit 

1 

Simulated  Annealing 

GALANT 

« 

Lin’s  Algorithm 

Area 

(Asa) 

Run  time 

Area 

(A<j) 

Run  time 

Asa 

Area 

(Al) 

Run  time 

4^ 

Asa 

c432 

2331 

24mm  29s 

2331 

4.638 

2331 

HQ 

2337 

24mm  28s 

2337 

4.668 

2337 

25mm  11s 

4.72s 

2368 

■HW 

2372 

25mm  40s 

2376 

4.81s 

2376 

0.17s 

lUii^ 

15.5 

2394 

25min  45s 

2402 

5.02s 

1.003 

2450 

0.16s 

1.023 

15.0 

2420 

26mm  288 

2424 

4.968 

HQ 

2465 

0.13s 

1.019 

14.5 

2467 

26mm  17a 

2467 

5.02s 

2719 

0.16s 

1.102 

14.0 

2515 

26mm  32s 

2515 

S*398 

2749 

0.23s 

1.092 

13.5 

2563 

27mm  47s 

2563 

5.688 

1.000 

2929 

0.17s 

1.143 

13.0 

2645 

27mm  57s 

2658 

6.058 

1.005 

3024 

0.15s 

1.143 

12.5 

2801 

28mm  33s 

2801 

7.248 

1.000 

3332 

0.268 

1.190 

■TiiH 

2950 

24mm  27s 

2983 

7.72s 

1.011 

- 

- 

- 

11.5 

3096 

36mm  48 

3139 

8.408 

1.014 

- 

- 

- 

11.0 

3300 

38min  28s 

3315 

13.35s 

1.005 

- 

- 

- 

10.5 

3546 

43min  58 

3583 

10.95s 

1.010 

- 

- 

- 

Average  Area  Ratio 

1.009 

1.206 

gates  are  at  minimum  size  at  the  solution,  and  the  vertices  of  the  feasible  region  where 
these  gates  are  at  nonminimum  sizes  are  clearly  suboptimal. 


2.7  Conclusions 


In  this  chapter,  an  efficient  adgorithm  is  presented  to  minimize  the  area  taken  by  cells 
in  standard-cell  designed  combinational  circuits  under  timing  constraints.  We  present 
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Nbmuiized  Ana 


C432 


Figure  2.10  Comparison  of  Gaiant  and  Lin’s  algorithm  against  simulated  annealing 
for  c432. 


a  comparison  of  the  results  of  our  algorithm  with  the  solutions  obtained  by  our  imple¬ 
mentation  of  Lin’s  algorithm  [23]  and  by  simulated  annealing.  In  [23],  it  was  shown  that 
Lin’s  algorithm  is  able  to  obtain  better  results  than  the  technology  mapping  of  MIS2  [8]. 
Although  Lin’s  algorithm  is  fast,  its  solution  becomes  excessively  pessimistic  for  tight 
delay  constraints.  For  very  tight  timing  constraints,  it  fails  to  obtain  a  solution  at  all. 
Experimental  results  show  that  our  approach  can  obtain  a  near-optimal  solution  (com¬ 
pared  to  simulated  annealing)  in  a  reasonable  amount  of  time,  even  for  very  tight  delay 
constraints.  By  adding  additional  linear  programming  constraints  to  account  for  short 
path  delay  [30],  and  slightly  modifying  the  mapping  and  adjusting  algorithm,  the  same 
approach  can  be  used  to  tackle  the  double-sided  delay  constraints  problem. 

The  major  bottleneck  of  our  approach  is  the  time  required  to  solve  the  linear  program. 
Our  approach  uses  a  linear  program  which  is  solved  using  a  package  available  in  the 
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public  domain  [31],  whose  base  is  a  sparse  matrix  dual  simplex  linear  program  solver. 
It  is  possible  to  reduce  the  CPU  usage  using  vector  processors;  as  pointed  put  in  [31], 
the  CPU  usage  can  be  reduced  by  about  40%  on  an  Alliant  FX/8  machine.  Although 
the  computational  complexity  of  the  simplmc  method  can  be  exponratial  in  the  worst 
case,  it  has  been  observed  that  for  most  practical  problems,  the  complexity  ranges  from 
0((l/n+l/(m— ni))“')  to  0((l/n+l/(m— n+1)  — 1/m)"^)  for  m  inequality  constraints 
and  n  variables  [32].  Other  polynomial-time  linear  programming  algorithms  such  as 
Karmarkar’s  algorithm  [33]  may  also  be  employed;  however,  in  practice,  its  average  nm- 
time  has  been  found  to  be  similar  to  that  of  the  simplex  algorithm. 

Finally,  to  increase  the  accuracy  of  the  results,  instead  of  using  the  AC  delay  model, 
one  can  use  fast  timing  simulation  to  evaluate  delay  of  the  circuit  during  implicit  enu¬ 
meration.  The  run  time  will  be  greater.  However,  as  we  have  mentioned,  the  number 
of  critical  gates  is  likely  to  be  relatively  small.  Therefore,  the  size  of  state-space  tree  is 
usually  small.  This  means  that  the  niunber  of  times  that  we  have  to  perform  fast  timing 
simulation  would  also  be  small. 
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CHAPTER  3 


OPTIMIZATION  FOR 
SYNCHRONOUS  SEQUENTIAL 

CIRCUITS 

3.1  Introduction 

The  deiay*area  optimiaation  problem  for  a  combinational  circuit  is  examined  in  Chap¬ 
ter  2.  Optimization  for  synchronous  sequential  circuits,  on  the  other  hand,  is  different. 
An  additional  degree  of  freedom  is  available  to  the  designer  in  that  one  can  set  the  time 
at  which  clock  signals  arrive  at  various  flip-flops  (FFs)  in  the  circuit  by  controlling  in¬ 
terconnect  delays  in  the  clock  signal  distribution  network.  With  such  adjustments,  it  is 
possible  to  change  the  delay  specifications  for  the  combinational  stages  of  a  synchronous 
sequential  circuit  to  allow  for  better  sizing.  This  effect  is  even  more  important  in  the 
standard-cell  environment,  where  the  granularity  of  available  choices  for  gate  sizes  is 
coarse,  and  the  delay  of  an  optimally  sized  combinational  subcircuit  may  differ  signifi¬ 
cantly  from  its  delay  specification.  However,  consideration  of  clock  skew  in  conjimction 
with  sizing  increases  the  complexity  of  the  problem  tremendously,  since  it  is  no  longer 
possible  to  decouple  the  problem  and  solve  it  on  one  subcircuit  at  a  time. 
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Figure  3.1  The  advuitages  of  nonzero  clock  skew. 


cue 


Example  3.1  Consider  the  circuit  shown  in  Figure  3.1.  If  the  gates  in  Block  1  are  sized 
substantially,  while  those  in  Block  2  are  close  to  their  minimum  sizes,  then  by  allowing 
a  clock  skew  at  FF  B,  it  is  possible  to  increase  the  delay  specification  for  Block  1  and 
decrease  that  for  Block  2.  This  could  reduce  the  area  of  Block  1  greatly,  at  the  expense 
of  a  small  increase  in  the  area  of  Block  2.  □ 

Example  3.2  Consider  the  synchronous  sequential  circuit  shown  in  Figure  3.2.  In  ad* 
dition  to  the  possibility  of  adjusting  clock  skews  at  boundary  latches  as  in  Example  3.1, 
we  can  adjust  clock  skews  at  internal  latches  as  well.  By  doing  so,  it  is  also  possible  to 
reduce  the  circuit  area  of  the  combinational  block.  □ 

In  general,  given  a  combinational  circuit  segment  that  lies  between  two  flip-flops  i 
and  j,  if  s,-  and  Sj  are  the  clock  arrival  times  at  the  two  flip-flops,  we  have  the  following 
relations: 


3i  +  Maxdelay{i,j)  +  Tyetup  ^  Sj  P 

Si  +  Mindelay{iJ)  >  Sj  +  Tkoid 


(3.1) 

(3.2) 
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Figure  3.2  An  example  illustrating  the  definition  of  a  synchronous  block. 

where  Maxdelay{i,j)  and  Mindelay{i,j)  are,  respectively,  the  maximum  and  the  mini¬ 
mum  combinational  delays  between  the  two  flip-flops,  and  P  is  the  clock  period.  Fish- 
bum  [9]  studied  the  clock  skew  problem,  under  the  assumption  that  the  delays  of  the 
combinational  segments  are  constant,  and  formulated  the  problem  of  finding  the  optimal 
clock  period  and  the  optimal  skews  as  a  linear  program.  The  objective  was  to  minimize 
P,  with  the  constraints  given  by  the  ineqtialities  in  (3.1)  and  (3.2)  above.  In  real  design 
situations,  however,  P  is  dictated  by  system  requirements,  and  the  real  problem  is  to 
reduce  the  circuit  area. 

In  this  chapter,  we  examine  the  following  problem:  Given  a  clock  period  specification, 
how  can  the  area  of  a  synchronous  sequential  circuit  be  minimized  by  appropriately 
selecting  gate  size  for  each  gate  in  the  circuit  from  a  standard-cell  library,  and  by  adjusting 
the  delays  between  the  central  clock  and  individual  flip-flops?  For  simplicity,  the  analysis 
will  use  positive-edge-triggered  0-flip-flops.  In  the  following,  the  terminologies  flip-flop 
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(FF)  and  latch  will  be  used  iaterchangably.  We  assume  that  all  primary  inputs  (Pis)  and 
primary  outputs  (POs)  are  connected  to  FFs  outside  the  system,  and  are  clocked  with 
zero  (or  constant)  skew. 

We  first  present  an  algorithm  for  small  synchronous  sequential  circuits  and  then  show 
how  it  can  be  extended  to  arbitrarily  large  circuits.  The  algorithm  works  in  three  phases 
to  solve  the  problem.  In  the  first  phase,  the  combined  gate  sizing  and  clock  skew  op¬ 
timization  problem  is  formulated  as  an  LP.  The  solution  of  this  LP  provides  us  with  a 
set  of  gate  sizes  that  does  not  necessarily  belong  to  the  set  of  allowable  sizes.  Hence, 
in  the  second  phase,  we  move  from  the  LP  solution  to  a  set  of  allowable  gate  sizes, 
using  heuristic  techniques.  At  the  end  of  the  second  phase,  the  set  of  allowable  sizes 
obtained  may  not  satisfy  (3.1)  and  (3.2)  simultaneously.  Hence  in  the  third  stage,  we 
fine-tune  the  longest  path  to  satisfy  (3.1)  and  satisfy  the  short  path  constraints  in  (3.2) 
by  appropriately  inserting  delay  buffers  in  the  short  path. 

In  Chapter  4,  we  consider  arbitrarily  large  synchronous  sequential  circuits  for  which 
the  sizes  of  the  formulated  LPs  are  prohibitively  large,  and  present  a  partitioning  edgo- 
rithm  to  handle  such  circuits.  The  partitioning  algorithm  is  used  to  control  the  compu¬ 
tational  cost  of  the  linear  programs.  After  the  partitioning  procedure,  we  can  apply  the 
optimization  algorithm  to  each  partitioned  subcircuit. 

This  chapter  is  organized  as  follows.  We  briefly  discuss  previous  work  on  clock  skew 
optimization  in  Section  3.2.  In  Section  3.3,  we  formulate  the  synchronous  sequential  cir¬ 
cuit  area  optimization  problem.  To  reduce  the  number  of  constraints  in  our  formxilation, 
we  propose  a  pruning  algorithm  in  Section  3.4.'  A  buffer  insertion  algorithm  is  presented 
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ia  Section  3.5,  which  is  used  to  satisfy  short-path  constraints  without  violating  long-path 
constraints.  Experimental  results  are  given  in  Section  3.6.  Finally,  Section  3.7  concludes 
this  chapter. 

« 

3.2  Previous  Work  on  Clock  Skew  Optimization 

Synchronous  circuit  designers  usually  try  to  eliminate  clock  skew.  Clock  skew  is 
referred  to  as  the  variations  in  the  delays  from  the  central  clock  source  to  individual  flip- 
flops  of  the  system.  This  effort  can  involve  equalization  of  wire  length  [34]  or  wire  width 
[35],  symmetric  design  of  the  distribution  network,  and  design  guidelines  to  eliminate  skew 
due  to  process  variations  [36].  Clock  skew  can  limit  the  clock  speed  of  a  synchronous 
system  or  cause  clocking  hazards  leading  to  malfunction  at  any  clock  rate. 

In  a  synchronous  sequential  circuit,  a  data  race  due  to  clock  skew  can  cause  the  system 
to  fail  [37].  Consider  a  synchronous  sequential  digital  system  with  flip-flops  (FFs)  as 
shown  in  Figure  3.3.  Let  Sj  denote  the  individual  delay  between  the  central  clock  source 
and  flip-flop  FFi,  and  let  P  be  the  clock  period.  Assume  that  there  is  a  data  path,  with 
delay  dij,  from  the  output  of  FFi  to  the  input  of  FFj  for  a  certain  input  combination  to 
the  system.  As  illustrated  in  Figure  3.4,  there  are  two  constraints  on  Si,Sj  and  dij  that 
must  be  satisfied: 

Double  Clocking  :  If  a,  >  s,-  +  dij,  then  when  the  positive  clock  edge  arrives  at  FFi, 
the  data  race  ahead  through  the  path  and  destroy  the  data  at  the  input  to  FFj 
before  the  clock  arrives  there.  When  the  clock  edge  Anally  arrives  at  FFj,  the 
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PRIMARY  INPUT 


Figure  3.3  A  synchronous  sequential  system. 

wrong  data  are  clocked  through.  Since  the  data  are  through  two  FF's  with  one 
clock  edge,  this  has  been  called  double-clocking. 

Zero  Clocking  :  This  occurs  when  s,-  +  dij  >  Sj  +  P,  i.e.,  the  data  reach  FF,  too  late. 
When  the  clock  edge  arrives  at  F Fj,  the  correct  data  are  not  ready  yet.  Since  no 
correct  data  are  clocked  in  by  a  FF,  this  is  called  zero-clocking. 

It  is,  therefore,  desirable  to  keep  the  maximum  (longest'path)  delay  small  to  maximize 
the  clock  speed,  while  keeping  the  minimum  (shortest-path)  delay  large  enough  to  avoid 
clock  hazards. 

In  [9],  Fishbum  developed  a  set  of  inequalities  which  indicates  whether  either  of  the 
above  hazards  is  present.  In  his  model,  each  FFi  receives  the  centrzd  clock  signal  delayed 
by  by  the  delay  element  imposed  between  it  and  the  central  clock.  Further,  in  order 
for  a  FF  to  operate  correctly  when  the  clock  edge  arrives  at  time  t,  it  is  assumed  that  the 
correct  input  data  must  be  present  and  stable  during. the  time  interval  (t-Tietup,  t+Thoid), 
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Figure  3.4  Double-clocking  and  zero-clocking. 


where  Tsetup  and  Tkm  are  the  setup  time  ajid  hold  time  of  the  FF,  respectively.  For  all  of 
the  FFs,  the  lower  and  upper  bounds  MIN(i,j)  and  MAX{iJ)  (where  1  <  1,7  <  C,  C 
is  the  total  number  of  FFs  in  the  circuit)  are  computed,  which  are  the  times  required 
for  a  signal  edge  to  propagate  from  FFi  to  FFj.  Since  it  is  possible  that  multiple  paths 
exist  from  FFi  to  FFj,  MIN{i,j)  and  MAX(i,j)  must  be  computed  as  the  minimum 
and  maximum  of  these  path  delays;  if  no  such  path  exists,  define  MIN{i,j)  =  00  and 
MAX{i,j)  =  —00. 

To  avoid  double-clocking  between  FFi  and  FFj,  the  data  edge  generated  at  FF,  by 
a  clock  edge  may  not  arrive  at  FFj  earlier  than  Thou  after  the  latest  arrival  of  the  same 
clock  edge  arrives  at  FFj.  The  clock  edge  arrives  at  FFi  at  s,,  the  fastest  propagation 
from  FFi  to  FFj  is  MIN{i,j).  The  arrival  time  of  the  clock  edge  at  FFj  is  Sj.  Thus, 
we  have 
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Si  +  MIN(i,j)  >  Sj  +  Tkoid- 


(3.3) 


Similarly,  to  avoid  zero-clocking,  the  data  generated  at  FFi  by  the  clock  edge  must 

« 

arrive  at  FFj  no  later  than  T^tup  amount  of  time  before  the  next  clock  edge  arrives.  The 
slowest  propagation  time  from  FFi  to  FFj  is  AfAX(iJ).  The  clock  period  is  P,  thus 
the  next  clock  edge  arrives  at  FFj  at  sj  +  P.  Therefore, 

+  Tttiup  +  MAX(i^j)  <Sj  +  P.  (3.4) 

Inequalities  (3.3)  and  (3.4)  dictate  the  correct  operation  of  a  synchronous  sequential 
system. 

Two  different  optimization  problems  are  then  formulated  [9]  with  regard  to  clock  skew 
optimization.  They  are  discussed  briefly  in  the  following. 

3.2.1  Minimize  P  subject  to  clocking  constraints 

Assume  that  the  value  of  Tftaidi  and  the  maximum  and  minumum  delays  be¬ 

tween  each  pair  of  flip-flops  {MAX{iij),  MIN{i,j))  2ure  constant,  while  the  clock  period 
P  and  clock  skews  to  individual  flop-flops,  s,-,  are  variable.  To  make  the  period  P  as 
short  as  possible  while  satisfying  the  system  of  inequalities  Eq.  (3.3)  and  (3.4),  a  linear 
program  can  be  formtilated  as  follows: 
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minimize  P 


subject  to  Si  Sj  >Tkatd^  MIN{i,j),  Vi,j  =  (3-5) 

» 

Sj  —  +  P  >  Tsttup  +  MAX(i,j)t  V  *,  j  =  1,  •  •  • ,  £ 

3.2.2  Maximize  minimum  margin  for  error 

While  manufacturing  a  circuit,  it  is  inevitable  that  process  variations  will  cause  design 
parameters,  such  as  component  values,  to  waver  from  their  nominal  values.  As  a  r»ult, 
the  manufactured  circuit  may  no  longer  meet  some  design  specifications,  such  as  the 
requirements  on  the  delay.  On  the  other  hand,  a  system  on  the  verge  of  clock  hazards 
might  pass  system  diagnosis  but  malfunction  at  unpredictable  times  due  to  fluctuations 
in  ambient  temperature  or  power  supply  voltj^e.  One  way  to  increase  reliability  of  the 
system  and  prevent  these  problems  from  happening,  is  to  provide  a  safety  margin  over 
all  the  constraints  of  the  slack,  i.e.,  the  amount  by  which  the  inequality  is  satisfied.  This 
converts  the  problem  into  a  maximin  problem.  This  is  modeled  by  introducing  a  new 
variable  M,  which  is  added  to  each  of  the  msin  constraint  inequalities  so  that  when  M 
is  maximized  by  the  program,  it  will  be  the  minimum  slack  over  all  the  inequalities.  In 
this  problem  formulation,  M  and  s,-  are  variables  to  be  determined,  while  P  is  specified 
as  a  constant.  The  problem  can  be  formulated  as: 
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maximize  M 


* 

subject  to  Si  —  Sj  —  M  >  Tkm  —  MIN(i,j)^  '^ifj  —  l,-‘-  ,C 

Sj  ~  Si  —  M  >  Tittup  +  MAX{i,j)  —  P,  V  t,  j  =  1,  •  •  • ,  £ 


3.3  Formulation  of  Constraints 

In  Ftshburn’s  approach.  [9],  it  is  assumed  that  circuit  delays  are  fixed.  In  our  problem, 
since  gate  sizes  are  to  be  determined,  individual  gate  delays,  and  therefore  the  total  circuit 
delay,  are  variables,  while  the  clock  period  is  a  user-spediied  constant.  Therefore,  the 
problem  becomes  much  more  complicated  since  the  delays  MIN{iJ)  and  MAX{i,j) 
between  each  pair  of  latches  are  now  also  variables. 

Our  problem  requires  us  to  represent  path  delay  constraints  between  every  pair  of 
FFs.  This  may  be  achieved  by  performing  PERT  [38]  on  the  circuit  and  setting  all  FFs 
except  the  FF  of  interest  (e.g.,  FFi)  to  — oo  (oo)  for  the  longest  (shortest)  delay  path 
from  FFi  to  all  FFs,  and  the  arrival  time  at  the  FF  of  interest  is  set  to  0  [9|.  Therefore  in 
addition  to  the  longest-path  delay  variable  mi,,  for  the  shortest-path  delay,  we  introduce 
new  variables,  pk,  k  ^  1  •  Af,  which  correspond  to  the  shortest  delay  from  the  Pis  (the 

outputs  of  FFs  are  considered  as  pseudo-PIs)  up  to  the  output  of  Gk- 

Pi  +  dfc  >  pk,  V  j  6  Fanin{k).  (3.7) 
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To  represent  path  delays  between  every  p^r  of  FFs,  we  need  intermediate  variables  m\. 
(pj^)  to  represent  the  longest  (shortest)  delay  from  FFi  to  the  gate.  The.  number  of 
constraints  so  introduced  may  be  prohibitively  large.  An  efficient  procedure  for  intelli¬ 
gent  selection  of  intermediate  m),  and  p),  variables  to  reduce  the  number  of  additional 
variables  and  constraints  without  making  approximations  has  been  developed.  Deferring 
a  discussion  on  these  procedures  to  Section  3.4,  we  now  formulate  the  linear  program  for 
a  general  synchronous  sequential  circuit  as 

minimize 

kml 

subjectto  dk  >  D(wk,  . . .  Wkjo(k)), 

Wk  >  Min3tze(k), 

Wk  <  Maxsize(k), 

For  aU  FF  t, 

+  Pk  ^  +  Tkold 

+  Tftiup  ^  S  Sj  +  Ptpee 

For  all  gates  A;  =  !,•••,  A/" 

+  dfc  <  m\, 

pj  +  dfc  >  pi, 


1  <  k<J^ 

1  <  k<Ar 

1  <  *  <  A/'  (3.8) 

1  <  t  <  £ 

I  <  j  <  C,  k  =  Fanin{FFj) 

1  <  j  <  £,  fc  =  Fanin{FFj) 

V I  6  Fanin{k) 

V  /  6  Famn(fc) 


The  above  is  a  linear  program  in  the  variables  tw,-,  d,,  m<,p,-  and  Sj.  Agaun,  the  entries 
in  the  constraint  matrix  are  very  sparse,  which  makes  the  problem  amenable  to  fast 
solution  by  sparse  linear  program  approaches. 
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3.4  Symbolic  Propagation  of  Constraints 

We  begin  by  counting  the  number  of  LP  constraints  in  (3.8).  We  ignore  the  constraints 

on  the  maximum  and  minimum  sizes  of  each  gate  since  these  are  handled  separately  by 

» 

the  simplex  method.  The  dk  inequalities  impose  q  constraints  for  each  of  the  gates  in  the 
circuit  to  the  LP  formulation  (see  Eq.  (2.15)).  Let  T  »  Fanin{i),  where  Af  is  total 
number  of  gates  in  the  circuit.  Then  for  each  FF,  there  are  0{^  +  C)  constraints,  where 
£  is  the  total  number  of  FFs  in  the  circuit.  Therefore  the  total  number  of  constraints 
could  be  as  large  as  0{Af  •q  +  C-(^  +  £)).  Assume  that  the  average  number  of  fan-ins 
to  a  gate  is  2.5  and  ^  =  5.  Then  ^  —  2.5Af,  and  £  •  .^  is  the  dominant  term  in  the 
expression  above.  For  real  circuits,  £  is  large,  and  hence  the  number  of  constrzunts  could 
be  tremendous.  In  this  section,  we  propose  a  symbolic  propagation  method  to  prune 
the  number  of  constraints  by  a  judicious  choice  of  the  intermediate  variables  m  and  p, 
without  sacrificing  accuracy.  Basically,  for  any  PI,  we  introduce  m  and  p  variables  for 
those  gates  that  are  in  that  PFs  fan-out  cone.  Also,  we  collapse  constraints  on  chains  of 
gates  wherever  possible  (line  6  in  Figure  3.5). 

The  synchronous  sequential  circuit  is  first  levelized.  For  this  purpose,  the  inputs 
of  FFs  are  considered  as  pseudo-POs  the  outputs  of  FFs  are  considered  as  pseudo-PIs. 
Two  string  variables,  m3tring{i)  and  p3tring{i),  are  used  to  store  the  long-path  delay 
and  short-path  delay  constraints  associated  with  gate  i,  respectively.  For  each  gate  and 
each  FF,  an  integer  variable  Wj  €  {0,1}  is  introduced  to  indicate  its  status;  that  is, 
the  variable  to,-  has  the  value  1  whenev  ^nnp(i)  and  p3tring{i)  are  nonempty,  i.e.. 
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ALGORITHM  SyabolicjiropagatioaC) 

1.  for  1  •  1  to  £  { 

2.  tOj  ♦-  0,  m3tring{j)  *-  ptring{j)  «-  for  all  gatas  and  Pi's; 

3.  for  j  •  1  to  aaxJLoval  { 

4.  for  aaclx  gata  k  at  laval  j  { 

5.  If  (  toi  s  0  for  all  /  €  fanin{k)  ) ;  /*  do  nothing  «/ 

6 .  if  (  anong  all  /  €  fanin{k) ,  azactlj  ona  tO{  s  1 ,  othars«0  )  { 

7 .  mstring{k)  mstring{V)  +  , 

pstring{k)  *-  pstring{V)  +  “d*",  Wk  *- I ; 

/*  wi>  *  1,  P  €  fanin{k)  */ 

} 

8 .  alsa  { 

9.  to*  <— 1,  Tnstring{k)  *- y  pstring{k)  *- ; 

10.  for  all  tof  —  1,  /  €  fanin(k)  { 

11.  writ a  down  tha  t«o  constraints, 

12 .  rMtring(l)  +  d*  <  m^,  pstring(l)  +  d*  >  , 

} 

} 

} 

} 


Fis«ira  3.5  The  symbolic  constraints  propagation  algorithm. 


when  the  constraints  stored  in  mstring{t)  and  p3tring{i)  must  be  propagated;  otherwise, 

Wi  *  0. 

The  algorithm  for  propagating  delay  construnts  symbolically  is  given  in  Figure  3.5.  In 
the  following  discussion  of  the  algorithm,  we  elaborate  on  the  formation  of  matring;  the 
formation  of  pairing  proceeds  analogously.  At  line  2,  for  each  gate  j,  Wj  and  matring{j) 
are  initialized  by  setting  Wj  —  0,  and  matring{j)  to  the  null  string.  At  line  5,  we  check  if 
W(  s  0  for  all  /  €  fanin{k),  i.e.,  if  all  of  gate  k’s  input  gates  have  a  null  matring.  If  so, 
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ao  constraints  have  to  be  propagated,  and  no  operaticMis  are  needed.  Next,  at  line  6,  we 
check  whether  exactly  one  of  gate  k’s  input  gates,  e.g.,  gate  V,  has  a  nonempty  matring; 
others  have  null  matring's.  If  so,  we  may  continue  to  propagate  the  constraint.  This  is 
implemented  by  concatenating  Tnatring{V)  and  and  storing  the 'resulting  string  in 
matring{k).  Also  w*  is  set  to  1  to  indicate  that  further  propagation  is  required  at  this 
gate.  Finally,  if  more  than  one  of  gate  Jb’s  input  gates  have  nonempty  matring,  we  add 
a  new  intermediate  variable,  m\,  and  the  string  "mj^”  is  stored  at  matring{k)  (line  9). 
For  each  input  gate  whose  matring  is  nonempty  {wi  s  1),  we  need  a  delay  constraint 
(line  12). 

Example  3.3  Figure  3.6  gives  an  example  that  illustrates  the  symbolic  delay  constraints 
propagation  algorithm.  Assume  that  mstrtnp(  11)  ss  “mji”,  ms<rtnp(12)  *  matring{\3)  =* 

(null  string).  Therefore,  from  lines  6  and  7  of  the  pseudo-code,  mstnnp(  14)  * 
+<^14”  and  u;i4  s  1.  Propagating  this  farther,  we  find  that  similarly,  mstring{l5)  = 
‘*m\i  +  di4  +  dis,”  and  w\i  —  1.  Finally,  for  gate  16,  we  apply  lines  9  through  12  and 
find  that  we  must  introduce  a  variable  and  set  wie  ~  1.  We  also  write  down  the 
two  constraints  shown  in  the  figure  and  add  these  to  the  set  of  LP  constraints.  □ 

By  using  the  symbolic  constraints  propagation  algorithm,  although  the  actual  reduc¬ 
tion  is  dependent  on  the  structure  of  the  circuit,  experimental  results  show  that  this 
algorithm  can  reduce  the  number  of  constraints  to  less  than  7%  of  the  original  number 
on  the  average  for  the  tested  circuits. 
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Figure  3.6  An  example  illustrating  symbolic  delay  propagation  algorithm. 


3.5  Satisfying  Short-Path  Delay  Constraints 

The  solution  of  the  LP  would,  in  general,  provide  a  gate  size,  z*,  that  does  not  belong 
to  the  permissible  set.  Si,  —  If  so,  we  consider  the  two  permissible  gate 

sizes  that  are  closest  to  Wk\  we  denote  the  nearest  larger  (smaller)  size  by  wk+  (u;*-).  As 
in  Section  2.4,  we  formulate  the  following  smaller  problem: 

For  all  fc  »  1  •  •  •  :  Select  Wk  «  wi,+  or  Wk-,  such  that 

for  all  FFs  I  <i  ,j  <  C 

3i  +  Maxdelay{i,j)  +  <  3>  + 

Si  +  Mindelay{i,j)  >  Sj  +  ThoIs 

The  mapping  algorithm  described  in  Section  2.4  can  be  used  to  obtain  a  solution  for  this 
problem. 
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After  the  m&pping  phase,  if  some  of  the  delay  constraints  cannot  be  satisfied,  we 
have  to  fine-time  some  gate  sizes  in  the  circuit.  In  Section  2.5,  we  have  discussed  the 
approach  to  resolving  the  violation  of  long-ps^h  delay  constraints.  The  same  strategy 
can  be  applied  to  synchronous  sequential  circuit  optimization,  accept  fhat  the  definition 
of  path  slack  must  be  modified. 

For  each  PO  j  (including  pseudo  POs  at  the  inputs  of  FFs),  the  required  maximum 
(minimum)  signal  arrival  times,  reqi(j)  {req,(J)),  can  be  expressed  as 

=  Sj+Tkoid  (3.9) 

The  path  slack  then  can  be  defined  as 

Pslack{Pi{n))  =*  reqi{n)  —  m„  (3.10) 


Violations  of  short-path  delay  constraints,  on  the  other  hand,  can  be  resolved  by 
inserting  delay  buffers.  However,  buffer  insertion  cannot  be  carried  out  arbitrarily,  since 
one  must  simultaneously  ensure  that  the  changes  in  the  circuit  do  not  violate  any  long 
path  constraints. 
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Note  that  if  gate  t  is  at  a  PO,  it  could  still  fan  out  to  other  gates  in  the  circuit;  this 
is  reflected  in  the  definition  of  the  gate  slack.  Physically,  gate  slack  corresponds  to  the 
amount  by  which  the  delay  of  gate  i  can  be  increased  before  its  effect  will  be  propagated 
to  any  POs  or  FFs,  in  terms  of  long-path  delay.  Therefore,  it  tell  us  tfle  maximum  delay 
that  a  delay  buffer  can  have  if  we  are  to  insert  a  delay  buffer  at  the  output  of  gate  i. 

For  example,  consider  a  part  of  a  circuit  as  shown  in  Figure  3.7.  In  the  circuit,  gates 
4  and  5  are  connected  to  flip-flops.  Gate  4  has  gates  1  and  2  as  its  fan-in  gates,  while 
gate  5  has  gate  2  and  3  as  its  fan-in  gates.  The  required  signal  arrival  times  at  the  inputs 
of  both  ^p-flops  are  indicated  in  the  figure.  The  long-path  and  short-path  signal  arrival 
times  of  gates  1,  2,  and  3  are  shown  in  the  figure.  The  delays  of  gates  4  and  5  are  0.5 
and  0.4,  respectively.  With  this  information,  the  long-path  and  short-path  signal  arrival 
times  of  gates  4  and  5  can  be  calculated,  and  are  given  in  the  figure.  As  we  can  see,  the 
short-path  signal  arrival  time  of  gate  4  is  2.1,  which  is  less  than  the  required  minimum 
signal  arrival  time,  2.3.  Therefore,  it  is  necessary  to  process  the  short  paths  to  gate  4. 
Gate  slacks  are  calculated  using  Eq.  (3.11).  For  the  time  being,  we  assume  that  inserting 
a  delay  buffer  at  the  output  of  a  gate  will  not  affect  the  delay  of  that  gate.  Since  the 
gate  slack  of  gate  2  is  0.2,  we  can  insert  a  delay  buffer  with  0.2  delay  immediately  at 
the  output  of  gate  2.^  After  the  buffer  insertion,  it  can  be  seen  that  p4  =  2.3, 1714  =  6.0, 
ps  —  2.2,  and  ms  —  5.5.  This  way,  the  required  minimum  signal  arrival  time  at  the  input 
of  flip-flop  A  is  satisfied,  while  none  of  the  required  maximum  signal  arrival  times  are 
violated.  On  the  other  hand,  if  we  insert  a  delay  buffer  with  delay  time  0.3  at  the  same 
^Alternatively,  we  can  insert  a  delay  buffer  immediately  at  the  input  of  gate  4. 
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Figure  3.7  An  example  illustrating  the  definition  of  Gslack. 


location,  it  can  be  shown  that  s  2.4,  s  6.0,  ps  ^  2.3,  and  ms  =  5.6.  Therefore, 
although  the  required  minimum  signal  arrival  time  at  the  input  of  flip-flop  A  is  satisfied, 
the  required  maximum  signal  arrival  time  at  the  input  of  flip-flop  B  is  violated  due  to 
insertion  of  the  buffer. 

If  output  gate  Gni  violates  the  hold  time  construnt,  its  shortest  path  P*(nl)  to  some 
PI  is  first  identified.  If  pni  is  the  worst-case  shortest-path  signal  arrival  time  of  gate 
nl,  and  re9,(nl)  is  the  required  shortest-path  delay,  then  the  delay  of  Pj(nl)  must  be 
increased  by  at  least  re9,(nl)  —  Pni- 

In  a  real  situation,  however,  inserting  a  buffer  at  the  output  of  a  gate  will  affect  the 
delay  of  that  gate.  Therefore,  care  must  be  taken  when  performing  buffer  insertion  to 
increase  short-path  delay. 

The  algorithm  for  inserting  buffers  is  shown  in  Figure  3.8.  At  the  beginning  of  this 
phase,  we  first  back-propagate  gate  slacks  from  POs  and  all  FFs.  The  gate  slack  of 
each  gate  is  determined  recursively  using  (3.11).  In  line  (4)  of  the  algorithm,  beginning 
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ALGORITHM  Ina«rtJ)uff«r(nl) 

1.  L«t  /*(nl)  b«  th«  shortMt  path  to  gata  nl,  and  Gni,‘’‘,Gnk  ba  on  path 

P,(nl)  (Gni  fana  out  to  2<i<k,  ifc  =  •  of  gatas  along  P,(nl).); 

2.  *  «-  1; 

$ 

3.  vhila  (.  Pni  <  )  { 

4.  if  (  i  a  (saallaat)  buffar.  bf,  in  tha  library  such  that: 

delay(Gni)  <  <ie/op'(G«,)  +  delay{bf)  <  delay{Gni)  +  slaek{Gni)  )  { 

5.  inaart  bf  at  tha  output  of  Gnil 

6.  inersaantally  i^data  siack(j)f  mj,  pj  for  aaeh  gata  j  in  tha  circuit; 

7.  ii  (  J»i»i  >  )  atop; 

8.  alsa  goto  1. 

} 

9.  *4-*  +  !; 

} 

Figure  3.8  The  buffer  insertion  algorithm. 

• 

&om  the  smallest  buffer  in  the  library,  we  try  to  insert  a  buffer  at  the  output  of  gate 
Gni.  The  delay  of  the  buffer  is  denoted  by  delay{bf).  Since  the  output  capacitance  of 
Gni  is  changed  during  this  process,  we  have  to  recalculate  its  delay,  which  is  denoted  by 
dtlay\Gni)- 

Example  3.4  In  Figure  3.9,  let  gate  4  be  connected  to  some  FF.  The  required  maximum 
arrival  time  (regi)  is  4.8,  and  the  required  minimum  arrival  time  (re^j)  is  1.3.  The  actual 
long-path  delays  (m^)  emd  short-path  delays  (p;)  for  all  gates  are  as  indicated.  The 
gate  slack  of  each  gate  is  calculated  and  shown  in  the  figure.  Since  gate  4  violates  the 
shortest-path  delay  requirement,  the  shortest-path  to  it,  Pj(4),  is  found;  this  can  be  seen 
to  include  gate  3.  Since  the  gate  slack  of  gate  3  is  1.0,  we  can  insert  a  delay  buffer  between 
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insert  a  delay  bi^fer  here 

Figure  3.9  An  example  illustrating  buffer  insertion  algorithm. 


gates  3  and  4.  If  delay{Z)  s  0.5,  the  delay  after  introducing  the  buffer,  de/ay'(3)  =  0.4, 
and  delay (bf)  —  0.3,  then  the  new  value  of  p4  is  1.4,  which  satisfies  req,(4).  □ 

3.6  Experimental  Results 

The  algorithms  above  were  implemented  in  program  GALANT-S  on  a  Sun  Sparc  10 
station.  The  test  circuits  include  many  of  the  ISCAS85  combinational  benchmark  circuits 
[29]  and  ISCAS89  synchronous  sequential  circuits  [39].  Each  cell  in  the  standard-cell 
library  has  five  different  sizes  of  realization  with  different  driving  capabilities. 

First,  in  Table  3.1,  the  experimental  results  using  the  symbolic  constraints  prop¬ 
agation  algorithm  are  listed.  For  each  circuit,  the  numbers  of  primary  inputs,  primary 
outputs,  flip-flops,  and  gates  are  also  shown.  Both  the  number  of  longest-path  delay 
constraints  without  using  symbolic  constr^nt  propagation  algorithm  and  the  number  of 
constraints  prtmed  by  the  algorithm  are  given.  It  is  clear  that  our  pruning  algorithm 


69 


is  very  efficient.  The  number  of  delay  constraints  is  reduced  by  more  than  93%  on  the 
average. 

For  a  given  desired  clock  period  (P,p*c)»  the  optimized  results  both  with  and  without 
clock  skew  optimization  are  shown  in  Table  3.2.  Depending  on  the  structxire  of  the 
circuits,  the  improvement  over  total  area  of  the  circuit  ranges  from  1.2%  to  almost  20%. 
As  for  the  execution  time,  the  run  time  ranges  from  about  the  same  for  some  circuits,  to 
less  than  double  or  triple  for  most  circuits. 

One  may  raise  the  question  of  whether  it  is  worthwhile  to  minimize  circuit  area 
through  clock  skew  optimization,  since  the  reduction  of  area  is  not  very  significant  for 
some  circuits.  However,  Table  3.3  provides  some  more  in-depth  experiments  of  two 
circuits,  s838  and  sl423.  In  this  experiment,  we  try  to  minimize  the  area  using  different 
specified  clock  periods.  As  one  can  see,  for  si  423,  the  minimum  clock  period  without  clock 
skew  optimization  is  about  32.5.  On  the  other  hand,  using  clock  skew  optimization,  the 
minimum  period  can  be  as  small  as  22,  which  gives  2ui  almost  33%  improvement  in  terms 
of  clock  speed.  For  s838,  using  clock  skew  optimization  also  gives  a  30%  improvement. 
Hence,  using  clock  skew  o  Mmization  can  not  only  reduce  the  circuit  eirea,  but  also  adlows 
a  faster  clock  speed. 

3.7  Comment  and  Conclusions 

3.7.1  Clock  tree  routing 

In  [34j,  Tsay  proposed  a  zero-skew  clock  tree  routing  algorithm.  In  his  approach,  a 
clock  tree  is  modeled  as  an  RC  tree  for  delay  analysis.  Based  on  a  lumped  delay  model 
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Table  3.1  Experimental  results  of  the  symbolic  constraints  propagation  algorithm  for 
ISCAS89  benchmark  circuits. 


Circuit 

#of 

#of 

#of 

#of 

Pis 

PCs 

FFs 

gates 

s27 

D 

1 

3 

10 

s208 

11 

2 

8 

104 

s298 

3 

6 

14 

137 

s344 

9 

11 

15 

160 

s349 

9 

11 

15 

160 

s382 

3 

6 

21 

158 

s386 

a 

D 

6 

171 

s400 

3 

D 

21 

162 

s420 

19 

2 

16 

196 

3 

6 

21 

181 

s510 

19 

a 

6 

211 

s526 

3 

6 

21 

229 

s641 

35 

24 

19 

379 

s838 

35 

2 

32 

446 

s953 

16 

23 

29 

395 

si  196 

14 

14 

18 

529 

sl423 

17 

5 

74 

657 

sl488 

8 

19 

6 

748 

sl494 

8 

19 

6 

725 

35378 

35 

49 

179 

2779 

longest-path  constraints 

original 

pruned 

% 

133 

27 

20.3% 

3276 

214 

6.5% 

4556 

280 

6.1% 

6720 

401 

6.0% 

6816 

417 

6.1% 

7488 

575 

7.7% 

4758 

282 

5.9% 

7824 

656 

8.4% 

11830 

544 

4.6% 

8592 

830 

9.7% 

10775 

553 

5.1% 

11688 

541 

4.6% 

30402 

1331 

4.4% 

55948 

2670 

4.8% 

34470 

1788 

5.2% 

32736 

2241 

6.8% 

106379 

7953 

7.5% 

IBESI 

1506 

7.2% 

7.3% 


0.7% 


20860 

911854 


Table  3.2  Performance  comparison  with  and  without  clock  skew  optimization  for  IS- 
CAS  89  benchmark  circuits. 


Circuit 

Ptpte 

with  clock  skew  opt. 

w/o  clock  skew  opt. 

% 

Area  (Ai) 

Run  time 

Area  (A3) 

Run  time 

s27 

3.75 

151.12 

0.32s 

179.29 

0.30s 

0.842 

s208 

6.8 

3.32s 

1745.25 

3.06s 

0.805 

s298 

6.5 

4.208 

2295.58 

4.12s 

0.926 

s344 

8.0 

2093.00 

7.10s 

2400.67 

6.91s 

0.872 

s349 

8.0 

2128.75 

6.18s 

2498.17 

6.01s 

0.852 

s382 

8.5 

2216.50 

7.68s 

2334.04 

6.04s 

0.949 

s386 

6.5 

3521.37 

7.55s 

3577.17 

6.14s 

0.984 

8.4 

2314.00 

8.19s 

2515.50 

7.13s 

0.920 

s420 

12.0 

9.06s 

2952.63 

8.94s 

0.854 

8.5 

2463.50 

11.55s 

2724.04 

7.22s 

0.904 

3510 

11.0 

3219.67 

16.13s 

3261.37 

10.35s 

0.987 

s526 

6.5 

3914.08 

10.21s 

4311.67 

9.35s 

0.908 

s641 

22.0 

4598.75 

51.59s 

4747.17 

26.49s 

0.969 

sd38 

10.5 

6162.00 

100.67s 

7324.42 

43.77s 

0.841 

s953 

10.5 

5516.87 

243.93s 

5898.75 

67.69s 

0.935 

si  196 

12.0 

8550.21 

288.15s 

8752.42 

97.43s 

0.977 

sl423 

35.0 

9871.87 

1069.75s 

10151.38 

80.71s 

0.972 

S1488 

10.0 

15025.29 

148.27s 

15322.12 

137.61s 

0.981 

sl494 

10.0 

14773.96 

158.14s 

14962.46 

115.45s 

0.987 

35378 

10.0 

29219.12 

2633.78s 

29717.53 

1414.49s 

0.983 
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I 
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Table  3.3  Improving  possible  clocking  speeds  using  clock  skew  optimization. 


Circuit 

with  clock  skew  opt. 

w/o  clock  skew  opt. 

Area  (Ai) 

Run  time 

Area  (Aa) 

Run  time 

s838 

10.5 

6162.00 

100.67s 

7324.42 

43.77s 

0.841 

10.25 

6165.25 

102.18s 

7365.58 

45.30s 

0.837 

10.0 

6182.04 

103.25s 

- 

- 

- 

13. 

6637.58 

130.20s 

- 

- 

- 

6.75 

7417.58 

172.31s 

- 

- 

- 

6.5 

- 

- 

- 

- 

- 

S1423 

35.0 

9871.87 

1069.75s 

10151.38 

80.71s 

0.972 

32.5 

9998.63 

1130.89s 

10545.71 

84.05s 

0.948 

30.0 

10154.08 

1450.03s 

- 

- 

- 

22.0 

12178.83 

1605.43s 

- 

- 

- 

20.0 

- 

- 

- 

- 

- 
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and  the  delay  computation  method,  he  fotmd  that  any  two  zero-skewed  subtrees  can  be 
merged  into  a  tree  with  zero  skew  by  tapping  the  connection  to  a  specific  location  of  e2M:h 
subtree.  The  approach  is  a  recursive  bottom-up  algorithm.  To  realize  the  clock  routing 
of  a  nonzero-skew  system  as  in  our  approach,  Tsay’s  zero-skew  routing  algorithm  can 
be  modified  to  handle  the  problem  [34].  This  can  be  done  by  adding  a  fictitious  delay 
element  on  each  clock  pin. 

Let  us  assume  that  the  optimal  clock  delay  to  latch  t  is  Do  +  A,  where  Do  is  a 
common  offset  value  which  is  unknown  until  the  clock  routing  is  determined.  Thus,  the 
skew  between  latch  i  and  latch  j  is  Di  —  Dj.  Let  Dmax  he  the  maximum  clock  delay,  i.e., 

Dm«*  =!  mM(Do  +  Dfc)  =  Do  +  maccDfc  (3.12) 

Define  the  fictitious  delay  of  latch  i  as 

di  ^  Dmax  -  iDo  + Di)  -  m^Dk  -  Di  (3.13) 

k 

In  other  words,  each  clock  pin  attached  to  a  latch  is  modeled  as  a  lumped  delay  model 
with  an  input  loading  capacitance  and  a  branch  delay,  as  shown  in  Figure  3.10.  Then  the 
zero-skew  routing  algorithm  is  performed  on  this  modified  clock  tree  with  the  fictitious 
delay  on  each  clock  pin. 

3.7.2  Conclusions 

In  this  chapter,  a  unified  approach  to  minimizing  synchronous  sequential  circuit  area 
and  optimizing  clock  skews  has  also  been  presented.  Traditionally,  the  circuit  area  of  a 
synchronous  sequential  circuit  is  minimized  one  combinational  subcircuit  at  a  time.  Our 
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(a)  (b) 

Figure  3.10  (a)  A  clock  pin  on  a  latch,  (b)  The  modified  model  of  a  clock  pin  according 
to  the  optimal  skew  obtained  firom  our  algorithm. 

experiments  have  shown  that  this  may  lead  to  very  suboptimal  solutions  in  some  cases. 
Experimental  restilts  show  that  using  clock  skew  optimization  can  not  only  reduce  the 
circuit  area,  but  also  allows  a  faster  clock  speed. 

In  our  formidation,  for  each  gate  in  the  circuit,  we  use  the  same  delay  variable  (dk) 
when  calculating  longest-path  delay  and  shortest-path  delay.  In  practice,  however,  the 
worst'Case  maximum  delay  and  worst-case  minimum  delay  are  different  for  a  specific  gate. 
Nonetheless,  our  formulation  and  algorithm  described. in  this  chapter  can  be  modified  to 
consider  this  effect. 

In  the  experimental  results,  only  active  circuit  area  (cell  area)  is  considered.  The  data 
do  not  include  the  clock  tree  routing  area.  It  is  possible  that  due  to  the  introduction 
of  clock  skew  at  each  latch,  the  clock  tree  routing  area  may  be  increased.  On  the  other 
hand,  since  both  positive  and  negative  clock  skews  are  sdlowed  at  each  latch,  it  is  possible 
that  the  net  increase  in  the  clock  tree  routing  area  may  be  insignificant.  Nevertheless, 
more  thorough  study  should  be  conducted  before  the  clock  skew  optimization  technique 
can  be  applied  to  real  circuit  designs. 
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Finally,  the  clock  skew  scheme  may  appear  similar  to  the  maximum-rate  pipelining 
technique  used  in  pipelined  computer  systems  [40].  However,  the  clock  in  a  msodmum- 
rate  pipeline  cannot  be  single-stepped  or  even  slowed  down  significantly.  This  makes 
maximum-rate  designs  extremely  hard  to  debug.  In  the  clock  skew  scheme,  by  constrast, 
single-stepping  is  always  possible  [9].  Therefore,  circuits  implemented  using  clock  skew 
techniques  can  be  debugged  without  difficulties. 


CHAPTER  4 

» 

PARTITIONING  FOR 
OPTIMIZATION 


4.1  Introduction 

As  indicated  in  Section  3.4,  the  number  of  constraints  in  our  formulation  of  the  LP 
is,  in  the  worst  case,  proportional  to  the  product  of  the  number  of  gates  and  the  number 
of  FFs  in  the  circuit.  Ideally,  for  a  given  synchronous  sequential  circuit,  all  variables 
and  constraints  should  be  considered  together  to  obtain  an  optimal  solution.  However, 
for  large  synchronous  sequential  circuits,  the  size  of  the  LP  could  be  prohibitively  large 
even  with  our  symbolic  constraint  propagation  algorithm.  Therefore,  it  is  desirable  to 
partition  large  synchronous  sequential  circuits  into  smaller,  more  tractable  subcircuits, 
so  that  we  can  apply  the  algorithm  described  in  Chapter  3  to  each  subcircuit.  While 
this  would  entail  some  loss  of  optimality,  an  efficient  partitioning  scheme  would  minimize 
that  loss;  moreover,  the  reduction  in  execution  time  would  be  very  rewarding. 

It  is  well-known  that  multiple-way  network  partitioning  problems  are  NP-hard  [41]. 
Therefore,  typical  aqiproaches  to  solving  such  problems  find  heuristics  that  will  yield 
approximate  solutions  in  polynomial  time  [42, 43].  Traditional  partitioning  problems 
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usually  have  explicit  objective  functions;  for  example,  in  physical  layout  it  is  desirable 
to  have  minimal  interface  signals  resulting  from  partitioning  the  circuit,  and  hence  the 
objective  function  to  be  minimized  there  is  the  number  of  nets  connecting  more  tham 
two  blocks.  Our  synchronous  sequential  circuit  partitioning  problem,  however,  is  made 
harder  by  the  absence  of  a  well-defined  objective  function;  since  our  ultimate  goal  is  to 
minimize  the  total  area  of  the  circuit,  there  is  no  direct  physical  measure  that  could  serve 
as  an  objective  fimction  for  partitioning.  In  this  chapter,  we  develop  a  heuristic  measure 
that  will  be  shown  to  be  an  effective  objective  function  for  our  partitioning  problem. 

In  this  chapter,  we  first  briefly  discuss  previous  work  on  network  partitioning  in 
Section  4.2.  We  develop  our  partitioning  algorithm  based  on  Sanchis’  multiple-way  par¬ 
titioning  algorithm  [42];  details  of  Sanchis’  algorithm  are  provided  in  Section  4.3.  We 
present  our  synchronous  sequential  circuit  partitioning  algorithm  in  Section  4.4.  Finally, 
experimental  results  are  given  in  Section  4.5,  and  we  conclude  this  chapter  in  the  same 
section. 

4.2  Previous  Work  on  Partitioning 

As  VLSI  system  complexity  increases,  a  divide-and-conquer  approach  is  used  to  keep 
the  circuit  design  process  tractable.  Using  this  strategy,  a  complex  problem  is  divided  into 
small  subproblems,  thus  reducing  the  complexity  of  the  original  problem  dramatically. 
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Given  a  circuit  (network)  consisting  of  a  set  of  modules  (nodes)  connected  by  a  set 
of  signals  (nets),  the  objective  of  a  K-way  partitioning  is  to  divide  the  whole  circuit  into 
K  subsets  such  that  the  number  of  signals  crossing  these  subsets  is  minimized. 

A  network  as  described  above  can  be  modeled  as  a  gn4>h  where  each  edge  (net) 
connects  exactly  two  vertices  (node).  The  graph  partitioning  problem  can  be  formally 
stated  as  follows.  We  are  given  an  undirected  graph,  G  ss  (V,  E)  where  vertices  V  = 
{vi,V3,‘**,Vn}  and  weighted  edges  e  *  («i.  Vj)  represent  the  cost  of  putting  v,  and 
Vj  in  separate  partitions.  The  problem  is  to  divide  the  vertices  into  k  disjoint  sets 
,Pk}  for  a  given  k,  such  that  some  cost  function  is  optimized.  The  cost 
function  can  be  based  on  the  weights  of  the  edges  cut  and/or  the  sizes  of  the  partitions. 

Ford  and  Fulkerson  [44]  proposed  the  max-flovf-min-cut  algorithm,  which  :  ids  the 
optimum  solution  between  subsets  of  unconstrained  sizes  in  polynominal  time.  Using 
their  algorithm,  a  minimum  cut  separating  designated  nodes  s  and  t  can  be  found  by 
flow  techniques  in  O(n^)  time,  where  n  =  jV]  =  number  of  vertices  in  the  graph.  Cut-tree 
techniques  [45]  will  yield  the  global  minimum  cut  using  n  —  1  minimum  cut  computation 
in  O(n^)  time.  However,  these  algorithms  tend  to  generate  very  unbalanced  partitions. 
Unfortunately,  when  size  balancing  constraints  are  imposed,  the  problem  becomes  NP- 
complete.  Because  of  its  importance,  many  heuristics  have  been  proposed  to  solve  the 
partitioning  problem  [42,43,46-51].  These  heuristics  can  be  classified  into  the  following 
two  categories: 
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(1)  Iterative  method  •  Iterative  heuristics  explore  the  solution  space  by  making  a  large 
number  of  moves  (small  changes  to  the  solution)  either  ramdomly  or  greedily  in  an 
attempt  to  discover  a  global  minimum. 

(2)  Spectral  method  -  In  spectral  partitioning  techniques,  the  eigenvectors  and  eigen¬ 
values  (spectrum)  of  a  graph  are  computed,  and  a  cost  function  is  shown  to  be 
minimized  by  a  function  of  the  spectrum.  Some  heuristic  is  used  for  mapping  the 
information  provided  by  the  eigenvectors  into  an  actual  partition. 

The  two  approaches  are  discussed  in  more  detail  in  the  following  two  subsections. 

4.2.1  Iterative  partitioning 

In  [46],  Kemighan  and  Lin  described  a  heuristic  procedure  for  graph  partitioning 
which  became  the  basis  for  most  of  the  iterative  improvement  partitioning  algorithms 
generally  used.  Their  algorithm  deals  with  the  problem  of  partitioning  a  network  with 
n  cells  (vertices)  (n  even)  into  two  partitions  of  n/2  cells  each.  The  basic  approach  is  to 
start  with  a  given  partition  and  to  improve  it  by  iteratively  choosing  one  node  from  each 
of  the  blocks  (partitions)  and  exchanging  them.  The  nodes  are  selected  to  be  switched  so 
that  a  maximum  decrease  in  cut-set  size  is  obtained  (or  minimum  increase  if  no  decrease 
is  possible).  The  algorithm  consists  of  a  series  of  passes.  In  each  pass,  two  nodes  axe 
interchanged  in  turn  until  all  n  nodes  have  been  moved.  In  each  iteration,  the  two  nodes 
to  be  moved  are  chosen  from  among  the  ones  which  have  not  been  moved  during  the 

pass.  At  the  end  of  each  pass,  the  n/2  partitions  produced  during  the  pass  are  examined 

\ 
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and  the  one  with  the  minimum  cut-set  size  is  chosen  as  the  starting  partition  for  the  next 
pass.  Passes  are  performed  until  no  improvement  in  cut-set  size  can  be  obtained. 

Fiduccia  and  Mattheyses  [47]  modified  the  Kernighan-Lin  algorithm.  Their  algorithm 
has  a  linear  worst-case  complexity  in  each  pass.  In  thor  algorithm,  only  one  cell  is  moved 
between  two  partitions  at  a  time  instead  of  switching  pairs.  This  allows  for  more  flexibility 
in  block  sizes.  In  addition,  a  method  is  introduced  for  keeping  the  candidates  in  each 
partition  sorted  at  all  times.  Elegant  data  structures  were  developed  through  which  they 
could  maintain  the  sorted  candidates,  and  thus  avoiding  searching  for  a  candidate  to 
be  moved.  Hence  a  linear-time  complexity  is  achieved.  Fiduccia  and  Mattheyses  also 
introduced  the  idea  of  preserving  balance  in  the  sizes  of  the  blocks.  Since  only  one  cell 
is  moved  at  a  time,  block  sizes  cannot  be  contrained  to  be  constant  during  the  pass. 
Instead,  each  block’s  size  is  constrained  to  be  within  a  given  range.  When  choosing  the 
next  cell  to  be  moved,  the  cell  with  the  highest  gain  (reducing  maximum  number  of  cuts 
across  the  partition)  in  each  block  is  examined.  It  will  always  ^e  possible  to  move  at 
least  one  of  these  cells  while  preserving  balance.  If  both  may  be  moved,  the  one  with  the 
highest  gain  is  selected. 

Krishnamurthy  [48]  further  improved  the  Fiduccia-Mattheyses  algorithm  by  refining 
the  method  for  choosing  the  best  cell  to  be  moved.  He  introduced  the  concept  of  level 
gain.  Consider  the  example  shown  in  Figure  4.1.  Moving  A  would  eliminate  one  net; 
moving  B,  however,  would  not  eliminate  any  net.  However,  if  we  move  B  and  C  together, 
two  nets  can  be  eliminated.  Therefore,  the  first  level  gain  (71)  of  A  is  1,  and  that  of  B 
is  0;  while  the  second  level  gain  (73)  of  A  is  0,  and  that  of  B  is  2.  The  gain  vector  of  a 
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1 


2 


Figure  4.1  An  example  illustrating  the  concept  of  level  gain. 

cell  E  is  then  defined  as 

ri(E)^<'ri(E),--m{E)>  (4.1) 

where  I  is  the  niimber  of  levels  used.  These  vectors  are  ordered  lexicographically.  At  each 
iteration,  the  free  cell  with  the  largest  gain  vector  is  moved.  Computing  higher-level  gains 
enables  the  algorithm  to  better  distinguish  between  cells  whose  first-level  gains  are  the 
same. 

In  [42],  Sanchis  further  generalized  Krishnamurthy’s  algorithm  to  deal  with  the  multiple¬ 
way  partitioning  problem.  There  are  several  ways  in  which  a  two-way  partitioning  algo¬ 
rithm  can  be  adapted  to  multiple-way  partitioning.  For  example,  one  can  successively 
choose  pairs  of  blocks  and  apply  the  two-way  algorithm  to  these  pairs.  However,  since 
eliminating  a  net  from  the  cut-set  formed  across  a  given  pair  of  blocks  does  not  neces¬ 
sarily  remove  it  from  the  cut-set  of  the  multiple  block  partition,  this  method  may  not 
be  able  to  obtain  good  results.  The  second  method  consists  of  a  hierarchicad  use  of  the 
two-way  adgorithm.  For  example,  for  a  four-way  partition  we  could  use  the  two-way  ad- 
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gorithm  to  partition  the  cells  into  two  blocks,  and  then  partition  each  of  these  two  blocks 
into  two  blocks  each.  However,  the  first  partitioning  will  try  to  minimize  the  number 
of  connections  between  the  first  two  blocks,  thus  tending  to  maximize  the  connections 
inside  these  two  blocks  and  making  it  harder  to  obtun  good  partitions  thereafter.  An 
alternative  for  obtaining  better  solutions  is  to  attempt  to  improve  the  partition  uniformly 
at  each  step.  Under  such  a  scheme,  we  should  consider  at  each  iteration  during  a  pass  all 
possible  moves  of  each  free  cell  from  its  home  block  to  any  of  the  other  blocks,  and  the 
best  of  such  moves  should  be  chosen.  This  is  the  basice  approach  taken  by  Sanchis.  Since 
we  develop  our  partitioning  scheme  based  on  Sanchis’  algorithm,  details  of  the  algorithm 
will  be  discussed  in  Section  4.3. 

More  recently,  Yeh  et  al.  [43]  proposed  a  general-purpose  multiple-way  partitioning 
algorithm.  In  their  approach,  a  top-down  clustering  is  carried  out  first  to  group  highly 
connected  subsircuits  into  clusters  and  then  condense  these  clusters  into  single  nodes  prior 
to  the  execution  of  iterative  procedure.  They  also  proposed  a  uniform  multipin  net  model 
to  capture  the  contributory  moves.  Consider  the  example  shown  in  Figure  4.2.  Suppose 
diiring  a  pass,  nodes  A,  B  and  C  have  not  been  locked.  Moving  A  would  eliminate  nets 
Cj,  e*,  and  e/  and  introduce  net  to  the  cut-set.  Thus  the  gain  would  be  2.  Moving 
B  would  not  eliminate  any  net,  but  would  sdlow  for  nets  e<,  e,,  and  Cjt  to  be  eliminated, 
provided  that  node  C  is  moved  along  with  B.  If  we  use  the  level  gain  model,  the  moving 
of  A  would  be  favored,  which  would  depress  the  movement  of  B  and  C.  Based  upon  the 
above  observation,  a  different  approach  is  introduced.  Let  us  now  concentrate  on  the 
perspective  of  nets.  If  we  wamt  to  eliminate  net  Cj,  we  would  have  to  move  B  and  C 


Figiire  4.2  An  example  for  multipin  net  model. 


together.  This  would  also  introduce  the  elimination  of  nets  e,  and  ei,  at  the  same  time. 
Thus  the  gain  would  be  3.  On  the  other  hand,  if  we  decide  to  remove  net  Cf,  we  would 
move  A  only,  and  the  gain  is  2.  Therefore,  if  we  view  a  move  as  initiated  by  a  net  instead 
of  a  node,  the  ambiguity  associated  with  selecting  moves  would  be  reduced.  Based  on  this 
model,  a  primal-dual  iteration  is  used  to  enhance  the  iteration  improvement.  The  primal 
process  is  based  on  the  Fiduccia-Mattheyses  algorithm.  The  dual  process  is  similar  to 
the  primal  process  except  that  it  concentrates  on  net  perspective  dtiring  the  pass. 

4.2.2  Spectral  partitioning 

Spectral-based  partitioning  extracts  information  about  the  structure  of  the  graph 
from  the  eigenvalues  and  eigenvectors  of  the  matrices  derived  from  the  graph.  A  graph 
can  be  represented  by  the  adjacency  matrix  A{G). 

{Oij,  if  {vi,Vj)^E 

(4.2) 

0,  otherwise 
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where  is  the  weight  of  the  edge  between  v,  and  vj.  By  convention.  An  =  0  for  all 
t  s  1, •  •  •  ,n.  If  we  let  d(vi)  denote  the  degree  of  node  v,-  (i.e.,  the  sum  of  weights  of  all 
edges  incident  on  v,),  we  obtain  the  n  x  n  di^onal  degree  matrix  D{G)  defined  by 


kmX 


0. 


if 


(4.3) 


The  Laplaeian  of  G  is  the  n  x  n  symmetric  matrix  Q{G)  —  D{G)  —  A{G).  Since  the 
rows  (and  columns)  sum  to  0,  the  Lapladan  is  singular;  it  has  rank  of  at  most  n  —  1  and  0 
as  an  eigenvalue.  In  fact,  the  multiplicity  of  the  0  eigenvalue  is  the  number  of  connected 
components  of  G. 

Donath  and  Hoffman  [52]  derived  a  lower  bound  on  the  weight  of  the  edges  cut  {Eg) 
by  a  partition,  satisfying  predetermined  partition  sizes.  If  mi  >  mj  >  •  •  •  >  mt  are 
the  given  partition  sizes  and  Ai  <  Aj  <  •  •  •  £  A*  are  the  smallest  k  eigenvalues  of  the 
Laplaeian,  then  Ec>\  Aim<. 

In  [53],  Hall  showed  that  the  eigenvaiues/eigenvectors  of  the  Laplaeian  solve  the  one¬ 
dimensional  quadratic  placement  problem  of  finding  the.vector  x  =  (zi,  xj, '  ■  ■ ,  x„)  which 
minimizes  the  total  weighted  squared  distance  between  n  points  that  can  be  expressed 
as 

*  -  5  S  Z(*<  -  (4.4) 

subject  to  the  constraints  |xj  =  (x^x)^/’  =  1. 

Here  Xi  is  the  coordinate  assigned  to  vertice  i;,-  in  a  one-dimensionzd  space.  The 
constraint  is  imposed  to  avoid  the  trivial  solution  in  which  all  ZjS  are  equal.  Equation 
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(4.4)  can  be  rewritten  in  matrix  notation  in  quadratic  form  as 


minimize  z  =  x^Qx 

subject  to  (x^x)^^*  ~  1  (4.5) 

To  solve  this  constrained  minimization  problem,  we  form  the  L^angian 

L  =  x^CJx  —  A(x^x  —  1)  (4.6) 

Taking  the  partial  derivative  of  L  with  respect  to  x  and  setting  it  equal  to  0  yields 

2gx-2Ax  =  0  (4.7) 

which  can  be  rewritten  as 

((5-A/)x*0  (4.8) 

where  /  is  the  identity  matrix. 

This  is  an  eigenvalue  formulation  for  A.  For  a  system  of  n  linear  equations,  there  2u:e  n 
possible  eigenvalues  Ai  <  A^  <  •  >  •  <  A*.  For  a  connected  graph,  the  Laplacian  has  rank 
of  n  —  1.  The  minimum  eigenvalue  0  gives  the  trivial  solution  x  =  •  •  • ,  1/ 

Hence  the  eigenvector  corresponding  to  the  second  smallest  eigenvalue,  A^  is  used.  The 
second  smallest  eigenvalue  is  a  lower  bound  on  a  nontrivial  solution  to  (4.5).  In  his 
paper,  Hall  heuristicaUy  derived  a  k-dimensional  generalization  in  which  the  eigenvectors 
are  used  as  the  basis  for  clustering  placement. 

Recently,  Hagen  and  Kahng  [49]  established  a  connection  between  Hall’s  formulation 
and  2-way  ratio-cut  [54]  partitioning.  They  construct  a  2-way  partition  from  vj  (the 


86 


corresponding  eigenvector  of  Aj)  by  sorting  tft  and  identifying  a  cut  in  the  sorted 
which  yields  the  best  ratio-cut  value. 

In  [51],  Chan  et  al.  developed  a  spectral  approach  to  multiple-way  ratio-cut  parti¬ 
tioning  which  provides  a  generalization  of  the  ratio-cut  cost  metric  to  ib-way  partitioning 
and  a  lower  bound  on  this  cost  metric.  Thdr  sq>proach  involves  finding  the  k  smallest 
eigenvalues/eigenvector  pairs  to  the  Lapladan.  The  eigenvectors  provide  an  embedding 
of  the  graph’s  n  vertices  into  a  h-dimensional  subspace.  A  heuristic  is  then  used  to 
enforce  the  points  in  the  embedding  into  k  partitions. 


4.3  Sanchis’  Multiple- way  Partitioning  Algorithm 

A  cell  is  labeled  fne  if  it  has  not  been  moved  during  the  pass;  otherwise,  it  is  labeled 
locked.  Define 


=  |{C|C  €  Ai  and  C  £  Cff  and  C  is  free}] 

^Ai{N)  =  \{C\C  6  Ai  and  C  ^Cn  tmd  C  is  locked}]  (4.9) 


Thus  is  the  number  of  free  cells  on  the  net  N  which  are  in  the  block  Aj,  while 

{N)  is  the  number  of  locked  cells  on  net  N  which  are  in  the  block  Ai.  For  each  block 
Ai  and  each  net  iV,  define  the  binding  numbers 


MN)  ifXAm^o 

00  if  Ax<  (iV)  >  0 


(4.10) 
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The  binding  number  of  a  net  with  respect  to  a  block  of  a  partition  indicates  how  tightly 
the  net  is  bound  to  the  block.  We  also  define  the  function  as  follows: 

That  is,  is  the  sum  of  all  of  the  binding  numbers  of  net  N  with  respect  to  all  of 

the  blocks  of  the  partition  except  block  it  gives  a  measure  of  how  tightly  N  is  boimd 
to  the  partitions  other  than  A,-. 

We  now  define  the  tth  level  gain  associated  with  moving  cell  C  from  block  Aj  to  block 

Afc. 

7^(C)  =  |{^  €  Nc\0'^^iN)  =  *  and  >  0}| 

^\{N  6  =  *  -  1  and  0^^{N)  >  0}|  (4.12) 

The  first  term  in  (4.12)  measures  the  ith  level  benefit  of  moving  cell  C  from  the  side  of 
the  partition  consisting  of  all  blocks  except  Aj^,  to  Ai^*  The  second  term  measures  the 
ith  level  penalty  of  moving  C  from  Aj  to  the  side  of  the  partition  consisting  of  all  blocks 
except  Aj. 

The  balance  requirement  for  the  block  sizes  can  be  satisfied  as  follows.  Let  r^,  ■  ■  • , 
be  such  that  0  <  ri  <  1  for  each  i  and 

(4.13) 

ial 

We  want  the  size  of  A,-  to  be  close  to  r.-c,  where  c  is  the  total  number  of  cells  in  the 
network.  A  parameter  w  is  chosen  such  that  0  <  <  mini<,<6(ric),  and  we  allow  the 

following  range  for  the  size  of  A^: 

riC-w<  |A<|  <  r.c  +  w  (4.14) 
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That  is,  a  cell  move  from  A,-  to  Aj  is  allowed  if  it  preserves  the  above  relationship  for  Ai 
and  Aj. 

Given  the  initial  partition,  the  algorithm  improves  the  partition  by  iteratively  moving 
one  cell  from  one  block  to  another  in  a  series  of  passes.  A  cell  is  labeled  free  if  it  has  not 
been  moved  during  that  pass.  Each  pass  in  turn  consists  of  a  series  of  iterations  during 
each  of  which  the  free  block  with  the  largest  gain  u  moved.  During  each  move,  we  ensure 
that  the  number  of  constraints  in  a  block  does  not  violate  the  constraints  given  by  (4.14). 
The  gain  vector,  rf(C'),  as  defined  in  (4.12),  is  updated  constantly  as  cells  are  moved 
from  one  block  to  another.  At  the  end  of  each  pass,  the  partitions  generated  during  that 
pass  are  examined  and  the  one  with  the  miTiiTniiTn  cut>set  size  is  chosen  as  the  starting 
partition  for  the  next  pass.  Passes  are  performed  until  no  improvement  of  the  objective 
value  can  be  obtained. 


4.4  Synchronous  Sequential  Circuit  Partitioning 

To  help  us  describe  our  partitioning  algorithm,  we  introduce  the  following  terminol¬ 
ogy.  For  a  synchronous  sequential  circuit,  such  as  one  shown  in  Figure  4.3,  we  define  the 
following: 

An  internal  latch  is  a  latch  whose  fan-in  and  fan-out  gates  belong  to  the  same  combi¬ 
national  block. 

A  sequential  block  consists  of  a  combinational  subcircuit  and  its  associated  internal 
latches. 
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Figure  4.3  An  example  illustrating  the  definition  of  an  internal  latch,  a  sequential 
block,  and  a  boundary  latch. 

Boundary  latches  are  latches  that  act  as  either  a  pseudo-PI  or  a  pseudo-PO  (but  not 
both)  to  a  combinational  block,  i.e.,  latches  whose  fan>in  and  fan-out  gates  belong 
to  different  combinational  blocks. 

A  partition  of  a  synchronous  sequential  circuit  N  is  a  partition  of  the  sequential 
blocks  of  N  into  disjoint  groups.  A  6-way  partitioning  of  the  network  is  described  by  the 
6-tuple  (Gi,  Gi, . . .  Gt)  where  the  G,s  are  disjoint  sets  of  sequential  blocks  whose  union 
is  the  entire  set  of  blocks  in  the  network.  Each  G,-  is  said  to  be  a  group  of  the  partition. 
After  partitioning,  boundary  latches  that  lie  between  groups  (that  do  not  belong  to  any 
groups)  will  be  set  to  have  constant  skews.  In  other  words,  we  do  not  have  any  control 
on  the  skews  of  those  latches  during  the  optimization  process. 

For  a  given  sequential  block  B,  let  Lq  denote  the  set  of  boundary  latches  incident  on 
B,  and  for  a  given  boundary  latch  L,  B^  denotes  the  set  of  sequential  blocks  that  L  is 
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Figure  4.4  Tightness  factor. 

connected  to.  For  each  boundary  latch  L.  we  define  input  tightness  Tin,  output  tightness 
ToHt,  and  the  tightness  ratio  r  as 

Tin{L)  a  maximum  combinational  delay  from  any  boundary  latch  to  L  in  the 
unsized  circuit, 

=  maximum  combinational  delay  from  L  to  any  boundary  latch  in  the 
unsized  circuit, 

'*«n/ T***  if  '^m  ^  '*’<»«« 

r{L)  =  (4.15) 

^outl^in  if  ^(in  ^  TbwI 

where  the  adjective  '^unsized”  implies  that  all  gates  in  the  subcircuit  are  at  the  minimmn 
size.  The  tightness  ratio  r(L)  provides  a  measure  of  how  advantageous  it  would  be  to 
provide  a  skew  at  L.  For  example,  in  Figure  4.4,  if  the  input  tightness  (of  path  Pin) 
is  3.0,  and  output  tightness  (of  path  Pant)  is  1.1,  retiming  the  path  PmU^out  would  be 
of  great  benefit.  Therefore,  if  the  circuit  contmns  an  FF  whose  input  and  output  are 
connected  to  paths  with  vastly  different  tightness  factors,  the  two  paths  should  remain 
in  the  same  partition. 
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Figure  4.5  Example  showing  the  definition  of  merit. 


For  each  pair  of  blocks  (Bt,  Bj),  define  merit  {Hj  as 

H  '•‘{Lk)  (4.16) 

where  Bj  Bj  means  latch  Lk  lies  between  B,-  and  B;.  The  value  of  mj  is  defined  to 
be  0  if  Bi  and  B,  are  disjoint.  For  example,  in  Figure  4.5,  sequential  blocks  (B,  and  Bj) 
are  connected  through  three  latches  Li,  L^,  and  L3  with  tightness  ratios  r{Li),  t{L2)  and 
t{L3),  respectively.  Then  the  merit  between  the  two  blocks  is  /ii,  =  t{Li)+t{L2)-¥t{L3). 
Physically,  fUj  is  used  to  measure  the  figure  of  merit  if  B,-  and  B;  are  in  the  same  group. 
A  high  fiij  means  that  the  tightness  ratio  is  high,  and  hence  B,-  and  By  should  be  in  the 
same  group. 

The  cost  associated  with  each  block,  B,,  is  c,,  the  number  of  linear  programming  con¬ 
straints  required  for  solving  Bj.  This  number  can  be  calculated  very  efliciently.  Assume 
that  group  Gt  consists  of  blocks  Bt,,t  =  l,...|Gt|.  Then  we  define  the  cost  of  G*, 

C'(Gk)  =  Slai* Cki,  and  the  merit  of  G*,  Af(Gt)  =  f^ij  We  now  formulate 

the  following  optimization  problem: 
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N 

max  EAf(G*) 

kml 

subject  to  C{Gk)  <  ot  *  MaxCanstraints 

where  N  is  the  number  of  groups,  MaxCanstraints  is  the  maximum  number  of  con¬ 
straints  that  one  wishes  to  feed  to  the  LP,  and  a  >  1  is  introduced  so  that  the  parti¬ 
tioning  procedure  becomes  more  flexible  since  the  cost  of  a  group  is  allowed  to  exceed 
MaxCanstraints  temporarily.  Now  that  the  partitioning  problem  has  been  explicitly 
defined,  we  develop  a  multiple-way  synchronous  sequential  circuit  partitioning  algorithm 
based  on  the  algorithm  proposed  by  Sanchis  [42]. 

For  each  group  Gi^  and  each  boundary  latch  L,  define  the  connection  number,  as 

*|{B1B€G*  and  B  6  Bi}|  (4.18) 

Since  each  boundary  latch  connects  exactly  two  blocks,  €  {0,1,2}.  In  other 

words,  if  ^  Bj,  then  (a)  if  B,-  ^  G*  and  Bj  ^  Gt,*Cj(f')  =  0  (Figure  4.6(a)),  (b) 
if  Bi  ^  G*  and  Bj  €  G*,  or  vice  versa  (Figure  4.6(b)),  9g„{L)  =  1,  and  (c)  if  B,-  6  G* 
and  B,  6  Gt,*ffj(I)  =  2  (Figure  4.6(c)). 

The  gain  associated  with  moving  B  from  G,-  to  G;  is  defined  as 

I 

—  ^  ^Gi{Ln)  =  2)  (4.19) 

n 

The  first  term  of  (4.19)  measures  the  benefit  of  moving  B  to  Gy,  while  the  second 
measures  the  penalty  of  moving  B  out  of  G,-. 
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Figtire  4.6  Example  showing  calculation  of  connection  numbers. 


As  an  example,  consider  a  scenario  illistrated  in  Figure  4.7.  According  to  the  figure, 
latch  Irm  belongs  to  group  G,-,  while  Li  does  not  belong  to  any  group.  Therefore,  we  are 
able  to  change  the  skew  of  latch  Ln.t  but  not  that  of  latch  Li.  If  we  move  sequential  block 
B  from  G,'  to  Gj,  latch  Li  would  be  included  in  group  Gy,  which  means  that  we  axe  able 
to  adjust  the  skew  of  latch  Li  when  we  apply  our  optimization  procedure  on  group  Gy . 
On  the  other  hand,  now  does  not  belong  to  any  group.  Therefore,  by  moving  B  from 
Gj  to  Gy,  we  obtain  control  over  latch  and  the  benefit  is  r(Lj),  Also  we  lose  control 
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on  latch  Ln^,  which  gives  a  penalty  of  r(£m).  Finally,  latch  Ln  does  not  play  a  role  here, 
since  it  does  not  belong  to  groups  G,  or  G; ,  before  or  after  the  moving. 

Before  beginning  the  partitioning  procedure,  the  number  of  linear  programming  con¬ 
straints,  Ci,  required  for  each  block  i  is  calculated  using  the  modified  symbolic  constraints 
propagation  algorithm.  If  Cj  >  MaxConstraints  for  some  block  B,,  then  it  is  placed  in 
a  group  alone  and  will  not  be  processed  later.  Let 

TotalConstraints  =  MaxConstraints)  (4.20) 

i 

Each  remaining  block  is  put  into  one  of  the  N'  groups, 

,  _  [TotalConstraints'i 

I  MaxConstraints  '  ^  ’ 


such  that  for  each  group  k,  C(Gt)  <  MaxConstraints.  This  is  an  integer  knapsack 
problem,  and  many  heuristic  algorithms  can  be  used  to  obtain  an  initial  partition  (see, 
for  example,  [55],  Chapter  2).  In  some  cases,  it  may  be  impossible  to  put  all  blocks  into 
N  groups  without  violating  the  restriction  on  C(Gt)  above;  if  so,  the  ntimber  of  groups 
may  be  larger  than  that  given  in  (4.21). 
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After  the  partitioning,  we  apply  the  optimization  algorithm  described  in  Chapter  3 
to  each  group. 


4.5  Experimental  Results  and  Conclusion 

Table  4.1  gives  the  experimental  results  for  the  partitioning  procedure.  Since  most 
of  the  ISCAS89  circuits  consist  of  only  one  combinational  block,  we  generated  some 
synchronous  sequential  random  logic  circuits.  The  number  of  gates  and  FFs  in  those 
circuits  are  shown  in  Table  4.1.  For  each  circuit,  we  conduct  three  experiments. 

( 1 )  First,  we  minimize  the  area  using  clock  skew  optimization,  but  without  partitioning. 

(2)  Secondly,  we  minimize  the  circuit  area  using  both  clock  skew  optimization  and 
partitioning. 

(3)  For  comparison,  we  minimize  the  circuit  with  neither  clock  skew  optimization  nor 
partitioning. 

From  the  table,  it  can  be  seen  that  the  first  approach  is  able  to  obtain  the  best  result 
as  expected.  Since  it  considers  all  variables  at  the  same  time,  it  provides  the  best  solution. 
However,  the  run  time  is  large.  Compared  to  the  first  approach,  the  second  approach  runs 
much  faster,  at  a  very  slight  area  penalty.  Not  surprisingly,  the  third  approach  gives  the 
worst  solution.  We  also  note  that  the  introduction  of  clock  skew  provides  a  significantly 
faster  clock  speed  for  circuit  ml337.  Although  it  has  not  been  shown  here,  the  same 
result  also  holds  for  ml783.  For  ml783,  we  also  specify  several  different  MaxConstraints. 
The  result  shows  that  as  the  specified  MaxConstraints  increases,  the  number  of  groups 
after  partitioning  decreases.  As  the  number  of  groups  decreases,  the  optimized  solution 
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Tftble  4.1  Performance  comparison  of  the  partitioning  procedure. 


Circuit 

#of 

Pis 

#of 

POs 

#of 

FFs 

#of 

gates 

#of 

blocks 

m51 

8 

8 

12 

51 

5 

ml44 

16 

2 

18 

144 

9 

ml337 

51 

53 

97 

1337 

42 

ml783 

90 

54 

124 

1783 

43 

Circuit 


m51 


inl44 


ml337  9.5 

9.25 


ml783 


w/o  partitioning 


Area  Run  time 


1.74s 


with  clock  skew  opt. 


with  partitioning 


731 


1872 


12364 


12353 


12685 


13049 


6.11s 


135.358 


151.348 


171.92s 


186.61s 


MxCnst^ 


300 


300 


1500 


1500 


1500 


1500 


1500 


Area 


813 


5  1953 


6  12370 


6  12356 


6  12689 


6  13112 


Run  time 


1.50s 


3.32s 


58.968 


57.91s 


60.748 


60.94s 


without 


Area  Run  time 


849 


2410 


13055 


1.29s 


2.87s 


47.54s 


18564 

427.14s 

300 

16 

18743 

155.07s 

21074 

140.23s 

■SB 

8 

18708 

156.558 

2000 

6 

18572 

159.938 

^  MxCnst  =  MaxGonstraints,  the  maximum  number  of  contraints. 
^  N,  number  of  groups  after  partitioning. 
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using  the  partitioning  procedure  improves,  while  the  run  time  increases  only  slightly. 
When  iV  ss  6,  the  solution  is  comparable  to  that  without  using  partitioning,  and  the  run 
time  is  still  far  less  than  that  without  using  partitioning. 

In  summary,  in  this  chapter  we  develop  a  synchronous  sequential  circuit  partitioning 
algorithm.  We  propose  a  heuristic  measure  which  is  shown  to  be  effective  as  the  objective 
function  of  the  partitioning  problem.  Elxperimental  results  show  that  our  partitioning 
procedure  is  very  effective  in  making  otir  optimization  algorithm  run  at  a  much  faster 
speed,  with  no  significant  degradation  in  the  quality  of  the  solution. 
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CHAPTER  5 

» 

DELAY  AND  AREA 
OPTIMIZATION  FOR 
PLACEMENT 


5.1  Introduction 

For  standard-cell  based  VLSI  circuits,  optimization  for  improving  timing  performance 
can  be  carried  out  at  three  levels  in  the  design  process:  logic  synthesis,  gate  size  selec¬ 
tion,  and  layout.  In  previous  chapters,  we  have  concentrated  on  optimizing  the  timing 
performance  of  a  VLSI  circuit  by  gate  sizing.  Thus  far,  we  have  not  considered  inter¬ 
connect  delay  due  to  wiring  capacitances.  As  the  size  of  today’s  VLSI  circuits  becomes 
increasingly  larger  and  the  device  size  becomes  smaller,  the  delay  of  a  circuit  becomes 
dominated  by  interconnect  delays  [56].  For  example,  in  an  SSI  or  MSI  chip  designed  in 
5  fiin.  nMOS  technology,  the  gate  input  capacitance  (15  fF  per  minimum  size  tramsistor) 
dominates  the  wiring  capacitance  (200  fF/mm).  A  typical  transistor  with  WJL  =  10  has 
a  capacitance  of  150  fF,  which  is  equivalent  to  0.75  mm  of  wire.  With  a  typical  MSI 
die  size  of  a  few  millimeters  on  a  side,  most  of  the  nets  will  be  well  below  0.75  mm. 
According  to  the  scaling  theory,  when  the  transistors  are  scaled  down,  the  gate  input 
capacitance  is  reduced,  while  the  wiring  capacitance  per  unit  length  is  unchanged  [56] . 
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Therefore  in  a  0.5  /xm  CMOS  technology,  a  minimum-sized  transistor  has  1.5  fF  input 
capacitance,  which  yields  a  typical  transistor  {W/L  =  10)  input  capacitance  of  15  fF. 
This  is  equivalent  to  0.075  mm  of  wire,  which  represents  a  large  number  of  the  nets  in  a 
typical  12  x  12  mm  VLSI  chip.  Consequently,  as  the  devices  are  scaled  down  further  in 
submicron  technology,  the  node  capacitances  will  not  go  down  as  much  as  they  did  in  a 
5  (oa  nMOS  MSI  chip  because  of  the  increased  role  of  the  wiring  capacitance,  and  the 
delay  of  a  circuit  is  dominated  by  interconnect  delay. 

In  this  chapter,  we  extend  our  work  to  consider  interconnect  delay.  We  col  ate  our 

work  on  the  gate  size  selection  and  placement  steps.  Layout  optimization,  also  referred  to 
as  timing-driven  layout  is  concerned  with  placement  and  routing.  From  the  overall  chip 
timing  viewpoint,  the  placement  steps  are  more  critical  than  the  routing  which  can  afPect 
mostly  the  local  issues  such  as  noise  coupling.  For  this  reason,  placement  has  received 
more  attention  in  timing-driven  layout. 

Recently,  there  has  been  extensive  research  on  timing-driven  placement  [10-13].  Timing- 
driven  placement  techniques  can  be  broadly  divided  into  two  categories:  net-oriented  and 
path-oriented.  In  the  net-oriented  approach,  the  acceptable  delay  of  each  gate  (cell)  is 
calculated  and  translated  into  boimds  on  the  delay  associated  with  each  net.  These 
bounds  then  serve  as  constraints  during  the  subsequent  placement  step.  In  the  path- 
oriented  approach,  timing  analyses  of  critical  paths  are  performed  dynamically  during 
the  placement  step.  All  paths,  or  a  subset  of  them,  are  taken  into  accoimt  implicitly  in 
the  formulation.  Since  the  delay  of  a  circuit  is  inherently  path-oriented,  it  is  expected 
that  path- based  approaches  can  obtain  better  solutions  [13,57]. 

A  standard-cell  library  typically  contains  several  versions  of  any  given  gate  type, 
each  of  which  has  a  different  gate  size.  The  gate-sizing  problem  is  that  of  choosing 
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optimal  gate  sizes  from  the  library  to  minimize  a  cost  function  (such  as  total  circuit 
area),  while  meeting  the  timing  constraints  imposed  on  the  circuit.  This  is  usually 
done  after  technology  mapping,  where  the  logic  function  of  each  gate  is  determined,  and 
before  the  physical  placement  step.  A  drawback  of  such  an  approach  is  that  accurate 
interconnect  wire  lengths  are  not  available  during  the  gate-sizing  procedure.  The  gate  size 
selected  optimally  at  that  stage  may  no  longer  be  optimal  after  the  physical  design  stage 
where  large  interconnect  capacitances  are  introduced  at  the  output  of  each  gate.  To  deal 
with  this  problem,  an  iteration  procedure  is  usually  followed.  After  global  placement, 
the  capacitance  associated  with  each  net  is  extracted,  and  the  gate-sizing  procedure 
is  repeated.  However,  in  such  an  iterative  approach,  the  variation  of  net  capacitance 
between  iterations  may  be  large  and  cause  large  perturbation  in  the  solutions.  Thus,  a 
number  of  iterations  may  be  required,  making  this  approach  quite  expensive.  To  deal 
with  this  problem,  it  is  desirable  that  gate  sizing  and  placement  be  incorporated  into  a 
single  procedure. 

As  an  illustration,  consider  a  layout  placraient  shown  in  Figure  5.1(a).  Gate  D 
fans  out  to  gates  Li,  and  L3.  Assume  that  the  delay  of  this  circuit  under  such  layout 
conditions  violates  the  timing  construnts  imposed  on  it.  Moreover,  D  and  L2  lie  on  a  long 
path  whose  delay  exceeds  the  timing  constraint.  Conventional  timing-driven  placement 
would  move  D,  Lx,  L2  and  Lz  closer  to  one  another  to  decrease  the  delay  of  gate  D,  as 
shown  in  Figure  5.1(b).  This  may  increase  the  wire  lengths  of  other  nets  attached  to 
cells  D,Li,L2  and  L3.  But  if  automatic  gate  sizing  is  incorporated  with  timing-driven 
placement,  a  possible  solution  would  be  to  replace  D  with  a  template  with  a  higher 
driving  capacity,  and  L\  with  one  with  a  smaller  loading  capacitance  with  respect  to  D. 
As  a  result,  some  of  the  cells  could  be  moved  to  better  locations,  as  shown  in  Figure 
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Figure  5.1  Advantage  of  gate  sizing  together  with  placement. 


5.1(c).  The  overall  effect  is  a  reduction  of  the  long-path  delay,  while  the  increase  in  area 
is  kept  to  a  minimum. 

In  [58],  Kim  et  al.  propose  an  areartiming-testability  driven  placement  algorithm. 
Their  algorithm  consists  of  a  series  of  iterations.  At  the  beginning  of  each  iteration, 
a  placement  using  Timberwolf  [59]  is  done  to  minimize  the  total  wire  length.  After 
placement,  a  set  of  partial  scan  flip-flops  is  selected,  followed  by  a  gate-sizing  step  [23]. 
After  gate  sizing  is  done,  timing  bounds  are  calculated  for  each  net;  then  Timberwolf  is 
called  again  to  obtain  an  improved  layout.  In  each  iteration  of  the  aimealing  step  inside 
Timberwolf,  cells  switch  their  positions  in  an  attempt  to  reduce  the  total  wire  length 
and  also  to  meet  the  timing  bound  assigned  to  each  net.  Therefore,  the  algorithm  is 
net-based,  and  the  gate  sizing  and  placement  steps  are  treated  separately. 

In  this  chapter,  we  propose  an  algorithm  which  combines  the  gate-sizing  problem 
and  timing-driven  placement  into  one  procedure.  By  considering  these  two  problems 
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together,  the  value  of  the  interconnect  capacitance  is  known  during  the  selection  stage 
of  the  automatic  sizing  procedure.  Therefore,  optimal  gate  sizes  can  be  chosen  for  each 
gate  based  on  layout  information,  thus  reducing  the  number  of  iterations  required  in 
the  conventional  approach.  For  simplicity,  we  use  a  row-based  layout  style.  However,  a 
more  general  arrangement  can  be  used.  In  the  following,  the  terminologies  ‘‘gate”  and 
‘‘cell”  are  used  interchangeably.  Both  refer  to  a  module  in  the  circuit.  Besides,  in  the 
following,  we  consider  combinational  circuits  only.  For  a  sequential  circuit,  we  can  apply 
our  algorithm  to  a  combinational  block  in  the  sequential  circuit  one  at  a  time. 

This  chapter  is  organized  as  follows.  Section  5.2  briefly  discusses  previous  work  on 
timing-driven  placement.  In  Section  5.3  we  formulate  the  task  of  timing-driven  placement 
with  automatic  gate  sizing  in  a  single  optimization  problem.  In  Section  5.4  we  describe  a 
novel  algorithm  which  performs  delay  and  area  optimization  for  a  given  compact  place¬ 
ment  by  means  of  gate  resizing  and  relocation.  Experimental  results  are  provided  in 
Section  5.5.  Finally,  we  conclude  the  ch^ter  in  Section  5.6. 

5.2  Previous  Work 

For  many  years,  timing-driven  layout  techniques  were  net-oriented.  The  timing  con¬ 
straints  derived  from  higher  levels  were  translated  into  bounds  on  delay  associated  with 
each  net,  and  timing-driven  placement  and  routing  were  used  to  synthesize  a  layout  sat¬ 
isfying  those  constraints.  However,  timing  is  not  associated  with  the  nets  but  with  the 
signal  flows  along  paths  which  are  combinations  of  nets  in  the  circuit.  Therefore,  instead 
of  satisfying  individual  net  delays,  the  constraints  on  the  sum  of  delays  of  all  of  the  nets 
constituting  a  path  must  be  satisfied. 
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To  take  into  consideration  more  accurate  timing  behavior  and  achieve  globally  better 
solutions,  timing  analysis  of  critical  paths  must  be  performed  dynamically  during  the 
placement  procedure.  Such  a  technique  is  proposed  by  Jackson  and  Kuh  in  [10]  where 
a  sequence  of  linear  programming  steps  is  used  to  determine  the  cell  placement  in  a 
hierarchical  approach.  The  delay  behavior  is  modeled  in  a  path-oriented  manner  and 
considers  intercell  delays  as  well  as  interconnect  and  pin  capacitances. 

Sutanthavibul  and  Shragowitz  [60]  proposed  a  hierarchical  constructive  placement 
algorithm  with  look-ahead  and  adaptive  placement  capabilities.  The  delay  functions  are 
computed  based  upon  the  net  geometry,  capacitance  per  unit  wire  length,  and  the  net 
loading,  to  arrive  at  the  path  delay  values. 

Donath  et  al.  [61]  introduced  an  approach  in  which  the  timing  is  evaluated  together 
with  routability  in  the  global  placement  step.  The  parameterized  delay  equations  are  used 
in  the  path  analysis.  During  the  placement  of  ceUs  on  the  critical  paths,  fast  incremental 
timing  analysis  is  performed  to  evaluate  the  feasibility  of  each  move.  A  complete  timing 
analysis  is  done  after  each  major  step. 

Srinivasan  et  al.  [11]  proposed  an  approach  based  on  Lagrangian  Relaxation.  They 
observed  that  only  a  small  subse*  of  timing  requirements  is  active  as  constraints  at  one 
time,  thus  the  problem  of  a  large  number  of  paths  can  be  effectively  avoided.  They 
represented  timing  requirements  by  a  set  of  linear  inequalities.  When  the  corresponding 
constrained  optimization  problem  is  turned  into  a  Lagrangian,  these  linear  inequalities 
make  the  Lagrangian  nondifferentiable.  The  subgradient  method  was  used  to  update 
Lagrange  multipliers  on  the  nondifferentiable  Lagrangian. 

Most  recently,  Hamada  et  al.  [13]  proposed  an  algorithm  which  also  transforms  the 
placement  with  timing  constraints  into  a  Lagrangian  problem.  A  primal-dual  approach 
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is  then  used  to  find  the  optimal  relative  module  locations.  In  each  primal  dual  iteration, 
the  primal  problem  is  solved  by  a  piecewise  linear  resistive  network  method,  while  the 
dual  process  is  used  to  update  the  Lagrange  multipliers  by  using  the  Newton  method. 


5.3  Timing-Driven  Placement  with  Gate  Sizing 

Typically,  a  path-based  timing-driven  placement  algorithm  formulates  the  placement 
problem  as  an  optimization  problem,  with  both  timing  reqtiirement  and  physical  place¬ 
ment  requirement  as  constraints.  The  constraints  are  usually  linear  ones.  The  objective 
function  can  be  either  a  linear  function  or  a  quadratic  function  of  the  cell  coordinates. 
A  quadratic  objective  function  allows  efficient  quadratic  programming  techniques  to  be 
used,  thus  the  problem  can  be  solved  relatively  fast.  However,  in  [62],  it  was  observed  that 
a  linear  objective  function  tends  to  reflect  the  actual  wiring  demands  more  accxirately 
than  the  quadratic  objective  function.  Therefore,  we  choose  to  use  a  linear  objective 
function  in  our  approach. 

A  circuit  can  be  modeled  as  a  set  of  M  gates  (cells),  Q  —  {^i,  ■  ■  *  },  interconnected 

by  a  set  of  iV  nets,  Af  =  {nj,  •  •  •  ,n^},  that  attach  to  the  ceils  at  pins.  For  the  sake  of 
simplicity,  we  assume  that  all  gates  in  the  circuit  are  of  single  output.  Therefore,  net 
m  is  associated  with  gate  gi.  Hence,  the  same  index  t  can  be  referred  to  as  both  a  gate 
and  a  net.  We  also  assume  that  all  pins  are  located  in  the  centers  of  ceils.  Therefore, 
the  physical  location  of  a  cell  i  on  the  chip  is  represented  by  (x,-,  y,),  where  Xj  (y,)  is  the 
X  (y)  coordinate  of  cell  z.  The  positions  of  the  I/O  pads  are  flxed  and  located  on  the 
perimeter  of  the  chip.  These  constraints  act  as  the  boundary  conditions. 
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Figure  5.2  Approximating  wire  length  using  bounding  box  method  for  2-  and  3-pin 
nets. 


Figure  5.3  Approximating  wire  length  using  bounding  box  method  for  4-  and  5-pin 
nets. 

There  are  three  categories  of  constraints  in  our  LP  formulation,  namely,  physical, 
timing,  and  sizing  constraints. 

5.3.1  Physical  constraints 

We  approximate  the  wire  length  of  an  individual  net  by  the  half-perimeter  of  the 
smallest  rectangle  enclosing  the  pins  of  the  net  [10].  This  approximation  is  the  same  as 
the  rectilinear,  minimal  Steiner  tree  length  for  two-  or  three-pin  nets  (Figure  5.2).  The 
approximation  error  for  four-  and  five-pin  nets  is  within  the  width  of  the  botmding  box 
of  the  Steiner  tree  length  [63]  (Figure  5.3).  The  bounding  box  for  net  i  is  denoted  by  four 
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Figure  5.4  Approximating  wire  length  using  boimding  box  method. 

parameters,  the  northernmost  (17,),  southernmost  {<Ti),  easternmost  (e^),  and  westernmost 
(ci;,-)  extents  of  the  pins  of  the  net  (Figure  5.4).  Mathematically,  the  bounding  box 
constraints  can  be  expressed  as  follows: 

<  Xij, 

Vi  —  y»j» 

^i  ^  Vl<;<pi  (5.1) 

where  p,-  is  the  number  of  pins  associated  with  net  i  and  j  is  a  pin  of  net  i. 

Let  and  Cv  denote  the  unit  length  wire  capacitance  in  horizontal  and  vertical 
layers,  respectively.  Then  the  interconnect  capacitance,  (7„  of  net  i  can  be  estimated  as 

Ci  =  Ch(ei  —  w,)  +  Cvivi  o‘i)  (5.2) 

Similarly,  the  length  of  net  t,  is 

li  =  (ci  -  Ui)  +  (jfi  -  <r,)  (5.3) 
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Therefore,  the  total  wire  length  of  the  layout  is 

i;  li  (5-4) 

tal 

* 

5.3.2  Timing  and  sizing  constraints 

Consider  a  single-output  gate  t  with  fi(i)  inputs,  and  gate  t  fans  out  to  fo{i)  gates. 
The  worst-case  signal  arrival  time  at  the  output  of  gate  t,  m^,  can  be  expressed  as 

m,-  >  ntij  -I-  di,  1  <  i  <  /*(*)  (5-5) 


Now  the  delay  of  gate  gi  can  be  contributed  to  the  loading  capacitance  of  its  fan-out 
gates,  plus  the  wire  capacitance  of  its  fan-out  net  n,.  Let  CL\j  represent  the  loading 
capacitance  of  gate  gij  with  respect  to  gate  jr,*.  Then  the  delay  of  gate  gi  is 


di 


Wi 


(5.6) 


p  Io{t) 

=  —  X  {Ci  -H  CL'ij)  -i-Ti  ‘  Wi  + 


(5.7) 


/L 

=  —  X  {Ck(ei  -  (jJi)  -f-  Cvim  -  (Ti)  +  ^  {ocij '  Wij  -f-  Aj)}  +  n  •  +  ’’2  (5.8) 

i-i 

where  Oij  and  0ij  are  related  to  the  (transistor)  gate  terminal  area  capacitance  and 
(transistor)  gate  terminal  perimeter  capacitance  of  the  transistor  of  ceil  j  to  which  cell  i 
fans  out  [17]. 

As  in  Section  2.3.1,  this  is  a  sum  of  functions  of  the  form  y/w.  Therefore,  it  can  be 
approximated  by  a  piecewise  linear  function. 
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5.3.3  Objective  function 


The  objective  function  of  our  optimization  problem  can  be  formulated  as  ' 

“““(51 7.-  •  +  T  •  52  ^i)  (5-^) 

Mil  IMl 

where  /,  is  the  length  of  net  t,  T  is  a  constant.  In  general,  we  may  want  to  set  T  equal 
to  the  sum  of  the  width  of  interconnect  wire  and  the  minimum  distance  between  two 
adjacent  wires.  That  way,  T  is  the  minimum  width  a  wire  occupies  on  the  chip. 

The  objective  function  in  this  formulation  represents  two  important  quantities  to  be 
minimized  in  physical  design.  The  first  term  is  the  total  area  of  the  cells.  The  second 
term  represents  the  total  area  taken  by  the  interconnect  wires. 


5.3.4  Slot  constraints 

For  most  placement  algorithms  using  mathematical  programming  techniques,  the  so¬ 
lutions  in  general  would  yield  a  placement  which  could  have  many  cell  overlaps.  There¬ 
fore,  placement  is  usually  alternated  with  partitioning  steps  that  generate  construnts  for 
the  next  step.  During  each  step,  the  following  constraunt  is  introduced  for  each  region: 


■ 

:€Mi 


=  F  •  E  w 

j€Mi 


(5.10) 


where  rf  (rf)  is  the  x  (y)  coordinate  of  the  center  of  the  i  th  region.  Mi.  Mi  is  the 
number  of  cells  in  that  region. 
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Equation  (5.10)  forces  the  center  of  gravity  of  all  cells  in  the  region  to  be  equal  to  the 
center  of  the  region.  Therefore,  the  cells  are  distributed  better  over  the  whole  plaw:ement 
region. 

5.3.5  Final  LP 

After  introducing  the  constraints  and  objective  function,  we  are  in  a  position  to 
formulate  the  following  linear  programming: 

M  N 

minimize  7,-  •  tn,-  +  T  •  ^  /,) 

ial  ml 

subject  to  For  all  gates  i  s  1  •  •  •  M 


m,  +<ii 

<  mi 

V  j  €  Fanin{i) 

mi 

V  gates 

iat  PO 

di 

>  D{Wi,  Wi,u  •  •  •  ,  Wijo(i),  Wi,  7,-,  (7.) 

Wi 

>  Minsize{i) 

Wi 

<  Maxsize{i) 

Si 

IV 

H 

VI 

> 

s. 

VI 

Ui 

<  Xij 

VI 

> 

<  Pi 

Vi 

Al 

VI  <j 

<Pi 

<Ti 

VI 

VI 

> 

<  Pi 

The  above  is  a  linear  program  in  the  variables  and  <7,-. 
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5.4  A  Unified  Algorithm  for  Adjusting  Placement 


and  Gate  Sizing 

Although  it  is  possible  to  solve  (5.11)  directly,  the  execution  time  hiay  be  excessively 
large  due  to  the  large  number  of  variables  and  constraints,  In  this  section,  we  present  an 
algorithm  which  tackles  this  problem  indirectly,  and  thus  reduces  execution  time. 

Notice  that  timing-driven  placement  is  needed  because,  in  general,  gate  sizes  are 
selected  before  the  placement  procedure,  and  gate  sizes  are  fixed  during  placement.  This 
imposes  a  restriction  on  the  placement  tool  in  the  search  for  a  good  placement  with 
minimum  wire  length.  On  the  other  hand,  although  placement  tools  such  as  those  in 
[59, 64]  can  obtain  a  placement  with  minimal  wire  length,  the  delay  of  the  circuit  based 
on  that  placement  may  exceed  timing  constrrints.  Recently,  it  has  been  suggested  that 
a  compact  placement  which  violates  the  timing  constraint  could  be  made  to  satisfy  the 
delay  bound  by  adjusting  the  sizes  of  some  gates,  without  altering  the  plau:ement  topology 
(Chapter  16,  [65]).  In  the  following,  we  propose  an  algorithm  which  combines  gate  resizing 
and  relocation  to  satisfy  timing  constraints,  and  at  the  same  time  the  total  circuit  area 
(including  cell  area  and  wire  length)  is  kept  to  a  minimum,  for  a  given  compact  placement. 

First,  all  of  the  gates  in  the  circuit  are  set  to  their  minimum  size.  A  compact  placement 
is  obtained  with  the  objective  of  minimizing  the  total  wire  length.  This  can  be  done  by 
using  existing  placement  packages  (e.g.,  Timberwolf  [59]  or  Gordian  [64]).  After  that,  the 
wiring  capacitance  associated  with  the  output  of  each  gate  is  calculated.  Based  on  this 
information,  together  with  the  circuit  structure,  optimal  gate  sizes  are  selected  using  the 
gate  size  optimization  algorithm  described  in  Sections  2.4  and  2.5.  In  general,  some  gates 
will  be  selected  to  have  a  larger  size.  This  may  cause  overlap  among  ceils.  This  problem 
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ALGORITHM  Rasiziag  and  RalocationO 

1.  do  initial  placaaant; 

2.  do  initial  gata  sizing  for  all  calls  in  tha  circuit; 

3.  vhila  (  timing  constraints  ara  not  satisfiad  )  { 

4.  salact  gatas  balonging  to  typa  1,  2,  and  3; 

5.  fonralata  LP  (Eq.  (S.il))  for  tbasa  gatas; 

(ramaining  calls  sarva  as  boundary  conditions) 

6.  solva  tba  LP; 

7.  usa  napping  algorithm  (Sac.  2.4)  to  obtain  parmissibla  siza 
for  aach  gata; 

8.  adjust  call  locations  to  avoid  ovarlap; 

9-  } 

10.  raport  final  placaaant; 

Figure  5.5  An  outUne  of  the  Resizing  and  Relocation  algorithm. 

can  be  solved  by  shifting  cells  to  avoid  overlap.  In  general,  however,  the  perturbation  on 
the  delays  of  individual  gates  may  catise  the  circuit  delay  to  exceed  the  delay  constraints. 
If  that  does  happen,  a  conventional  approach  would  repeat  the  gate-sizing  procedure 
to  guarantee  that  the  drcxiit  delay  of  that  specific  layout  is  below  the  delay  constraint. 
Usually,  the  gate  sizing  and  placement  procedures  have  to  be  repeated  a  few  times  before 
a  final  solution  is  reached. 

Our  algorithm,  in  contrast,  does  not  repeat  the  gate-sizing  procedure  all  over  again. 
Rather,  once  the  algorithm  detects  that  the  delay  of  the  circuit  is  violated,  a  number 
of  gates,  as  described  below,  are  selected.  These  gates  will  be  resized  and/or  moved 
to  different  locations  to  satisfy  time  constrsuns  as  well  as  to  minimize  total  circuit  area 
(including  cell  area  and  wire  length).  The  outline  of  our  algorit*^  a  is  shown  is  Figure  5.5, 
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In  the  following,  we  described  how  we  select  only  a  small  portion  of  gates  in  the  circuit 
for  resizing  and  relocation,  and  how  cell  resizing  and  relocation  can  be  combined  into 
one  formulation. 

In  addition  to  the  worst-case  signal  arrival  time,  m,-,  for  each  gate  i  in  the  circuit 
we  introduce  the  required  signed  arrival  tima,  ri.  The  required  signal  arrival  time  is  the 
latest  time  by  which  a  signal  has  to  arrive  at  the  output  of  gate  t  to  make  the  delay  at 
the  POs  less  than  the  specified  delay.  The  required  signal  arrival  time  is  defined  to  be 


r.- 


if  gate  i  at  PO 

. 

max{rj  —  d,  |  V  j  6  Fonott<(t)},  otherwise 


(5.12) 


For  each  gate  t,  we  also  define  a  slack  Si,  where. 

s,=siV— m,-  (5.13) 

Definition  5.1  An  active  gate  i  is  a  gate  with  s,-  <  0.  The  set  of  all  active  gates  is 
denoted  by  C. 

Definition  5.2  The  timing  of  a  circuit  layout  is  said  to  be  satisfied  if  and  only  if  C  is 
empty,  i.e.,  Si  >  0,  for  every  gate  i  in  the  circuit. 

Definition  5.3  A  critical  path  is  a  path  in  which  all  of  the  gates  along  the  path  have 
slack  values  less  than  or  equal  to  zero. 

Our  objective  is  to  satisfy  specified  delay  bounds  and  to  keep  the  total  circuit  area 
to  a  minimum.  This  can  be  achieved  in  two  ways.  The  first  one  is  to  resize  gates.  For 
example,  for  those  gates  lying  on  critical  paths,  we  may  replace  a  gate  with  a  template 
with  a  higher  driving  capacity  to  reduce  the  delay  of  that  gate.  Alternatively,  we  may 
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replace  a  gate  with  one  with  smaller  input  capacitance,  thus  reducing  the  delay  of  its 
driving  gates.  The  second  one  is  to  move  some  cells  to  new  locations,  so  as  to  reduce  the 
interconnect  wiring  capacitance  attached  to  those  gates  lying  on  critical  paths.  This  will 
also  reduce  the  delays  of  these  gates. 

The  unified  optimization  algorithm  begins  by  calculating  the  slack  of  each  gate.  Then 
three  types  of  gates  are  selected  for  improvement. 

(1)  The  first  type  is  active  gates,  which  are  gates  with  negative  slack.  These  gates  will 
be  allowed  to  change  their  sizes  as  well  as  be  free  to  move  to  new  locations.  Since 
active  gates  are  those  with  worst-case  signal  arrival  times  later  than  the  required 
signal  arrival  time,  it  is  most  important  to  adjust  the  delays  of  active  gates  such 
that  the  circuit  delay  is  less  than  the  specified  timing  constraints.  However,  in 
addition  to  adjusting  the  size  and  location  of  an  active  gate,  the  following  two 
types  of  cells  should  also  be  included  in  the  linear  program. 

(2)  The  second  type  involves  those  gates  with  nonnegative  slacks  less  than  a  small 
specified  value,  S.  During  this  phase  some  gates  will  change  their  size,  and  some 
others  will  be  moved  to  new  locations;  as  a  result,  output  load  capacitances  of 
certain  gates  will  increase  (while  those  of  others  will  decrease).  For  those  gates  with 
large  slacks,  it  is  likely  that  such  changes,  although  they  will  increase  their  delay, 
will  not  make  their  slack  negative.  Therefore,  those  gates  with  larger  slacks  are 
likely  to  remain  nonactive.  However,  for  those  gates  with  small  nonnegative  slacks, 
it  is  possible  that  such  a  delay  increase  will  make  their  slacks  become  negative. 
Therefore,  it  is  more  advantageous  to  include  those  gates  with  small  nonnegative 
slacks  in  the  formulation  to  avoid  additional  iterations. 
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(3)  The  third  type  of  gate  includes  those  that  are  directly  connected  to  the  outputs  of 
active  gates.  Remember  that  active  gates  are  those  which  violate  timing  constraints. 
Therefore,  reducing  the  delays  of  these  gates,  besides  changing  their  sizes,  can  also 
be  accomplished  by  reducing  their  output  load  capacitances.  This  can  be  done  by 
either  moving  or  reducing  the  sizes  of  the  active  gates*  fan-out  cells. 

Gates  belonging  to  types  1,  2,  and  3  are  put  into  the  linear  program,  (5.11),  and 
a  new  solution  is  obtained  by  solving  it.  hi  prindple,  to  obtain  a  better  solution,  it  is 
necessary  to  include  all  three  types  of  gates  in  the  linear  program.  In  practice,  however, 
to  maintain  the  efficiency  of  the  program,  it  is  necessary  to  limit  the  number  of  gates  to  be 
included.  In  addition,  since  many  gates’  locations  are  fixed,  they  can  serve  as  boundary 
conditions  for  physical  constraints.  Therefore,  the  gravity  centering  constraints,  (5.10), 
are  not  needed.  Furthermore,  to  avoid  drastically  changing  the  solution,  each  selected 
gate  is  allowed  to  change  to  its  nearest  larger  or  smaller  size  only. 

The  solution  of  such  a  formulated  linear  program  gives  a  new  size  and  a  new  position 
for  each  selected  cell.  The  mapping  algorithm  described  in  Section  2.4  is  used  to  obtain 
the  permissible  size  for  each  gate.  Since  many  cells  are  moved  to  new  locations,  and  some 
of  them  are  replaced  with  templates  of  diffident  sizes,  there  may  be  overlap  among  some 
cells.  Therefore,  it  is  necessary  to  move  cells  into  (slightly)  different  locations  to  avoid 
overly. 

If  necessary,  the  above  procedure  is  repeated  until  the  delay  constraints  are  all  satis¬ 
fied.  However,  according  to  our  experience,  only  one  iteration  is  needed  in  most  cases. 
Also,  since  only  a  relatively  small  number  of  gates  are  selected  to  be  resized  and  relo¬ 
cated,  the  execution  time  for  each  iteration  is  relatively  small  (compared  to  the  time 
needed  to  resize  all  gates). 
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Table  5.1  Experimental  results  of  PRECISE. 
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5.5  Experimental  Results 

The  above  algorithms  have  been  implemented  in  C  in  the  program  PRECISE  (PeRfor- 
mancE-driven  plaCement  with  automatic  gate  SizE  optimization)  on  a  Sun  Sparc  10 
Station. 

The  experimental  results  of  the  program  PRECISE,  which  implements  the  unified 
placement  improvement  and  gate  resizing  algorithms,  are  summarized  in  Table  5.1. 


To  show  the  effectiveness  of  our  algorithm,  we  intentionally  adjust  the  value  of  the  in¬ 
terconnect  wiring  capacitance  per  unit  length,  such  that  interconnect  delay  accounts  for 
about  30%  of  the  total  delay  of  each  circuit.  At  present,  we  use  Fiduccia’s  min-cut 
partitioning  algorithm  [47]  to  obtain  a  compiact  placement.  The  partitioning  adgorithm 
recursively  divides  cells  into  two  partitions  so  that  the  number  of  nets  that  cross  the  par¬ 
tition  boundaries  is  minimized,  until  a  small  number  of  cells  are  left  in  each  partition  and 
then  cells  are  placed  to  their  final  location.  It  has  been  observed  that  partitioning-based 
placement  tends  to  spread  the  wiring  across  the  layout  surface  and  thus  produces  very 
routable  placement  (Chapter  4,  [3]).  More  compact  placement  can  be  obtained  by  using 
other  algorithms  (e.g.,  [59,64]).  For  comparison,  we  also  perform  placement  and  gate 

sizing  based  on  the  purely  iterative  approach.  That  is,  the  two  procedures  of  placement 

« 

adjustment  and  gate  resizing  are  executed  separately  and  are  included  in  an  iteration 
loop.  The  experimental  results  show  that  PRECISE  is  able  to  obtain  better  solutions 
than  the  conventional  iterative  approach.  Moreover,  for  very  tight  timing  bounds,  the 
conventional  approach  fails  to  obtain  solutions  at  all.  This  is  because  cell  locations  are 
fixed  in  the  conventional  approach,  and  excessively  large  capacitances  may  have  been 
introduced  at  the  output  of  some  gates  on  critical  paths.  On  the  other  hand,  in  addi¬ 
tion  to  resizing  cells,  PRECISE  also  moves  cells  to  different  locations  to  reduce  large 
wiring  capacitance.  Therefore,  it  is  able  to  obtain  solutions  even  for  tight  delay  bounds. 
Furthermore,  since  instead  of  trying  to  resize  all  cells,  PRECISE  resizes  only  a  small 
portion  of  cells  when  timing  bounds  are  violated;  as  a  result,  its  execution  time  is  faster 
in  general. 
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5.6  Conclusion 


To  date,  gate  sizing  and  placement  are  treated  separately  in  different  steps  during  the 
circuit  design  process.  Such  an  approach  has  not  caused  much  trouble  because  intercon¬ 
nect  delay  takes  up  only  a  small  amount  of  the  circuit  delay  in  a  chip  fabricated  using 
today’s  VLSI  technology.  However,  as  the  devices  are  scaled  down  in  deep  submicron 
technology,  the  delay  of  a  circuit  becomes  dominated  by  interconnect  delay.  Therefore, 
it  becomes  more  and  more  important  to  combine  gate  sizing  and  placement  into  one 
procedure. 

In  this  chapter,  for  the  first  time,  the  gate-sizing  problem  is  combined  with  placement 
in  one  formulation.  Although  the  execution  time  for  the  combined  problem  may  be  ex¬ 
cessively  large,  we  propose  an  indirect  approach  to  fully  utilize  some  special  properties  of 
the  formulation  to  develop  a  novel  algorithm  which  performs  delay  and  area  optimization 
for  a  given  compact  placement,  by  resizing  and  relocating  cells  in  the  circuit  laj'out.  The 
experimental  results  are  very  encouraging. 
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CHAPTER  6 
CONCLUSIONS 


In  this  thesis,  an  efficient  algorithm  is  presented  to  minimize  the  area  taken  by  cells 
in  standard-cell  designs  under  timing  constraints.  Experimental  restilts  show  that  our 
approach  can  obtain  a  near-optimal  solution  (compared  to  simulated  annealing)  in  a 
reasonable  amount  of  time,  even  for  very  tight  delay  constraints. 

For  synchronous  sequential  circuits,  a  unified  approach  to  minimizing  circuit  area  and 
optimizing  clock  skews  is  presented.  Traditionally,  the  circuit  area  of  a  sequential  circuit 
is  minimized  one  combinational  subcircuit  at  a  time.  Our  experiments  have  shown  that 
this  may  lead  to  very  suboptimal  solutions  in  some  cases.  We  formidate  the  discrete 
gate-sizing  optimization  as  a  linear  program,  which  enables  us  to  integrate  the  equations 
with  clock  skew  optimization  constraints,  taking  a  more  global  view  of  the  problem. 
Experimental  results  show  that  this  approach  not  only  reduces  total  circuit  area,  but 
also  gives  much  faster  operational  clock  speed.  For  large  sequentiatl  circuits,  we  also 
present  a  partitioning  procedure.  Experiments  shows  that  our  partitioning  procedure  is 
very  effective.  Using  our  partitioning  procedure,  our  optimization  adgorithm  is  able  to 
run  at  a  much  faster  speed,  with  no  significant  degradation  in  the  quality  of  the  solution. 

To  date,  most  research  on  performance-driven  placement  assumes  that  gate  sizes  are 
selected  before  the  placement  stage.  This  imposes  a  restriction  on  the  placement  tool 
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in  searching  for  a  good  placement  with  minimum  wire  length.  Recently,  it  has  been 
suggested  that  a  compact  placement  which  violates  the  timing  constraint  could  be  mztde 
to  satisfy  the  delay  bound  by  adjusting  the  sizes  of  some  gates,  without  altering  the 
placement  topology  [65].  In  this  thesis,  we  have  shown  that  such  an  approach  may  lead 
to  solutions  of  inferior  quality.  Instead,  by  considering  resizing  and  moving  the  locations 
of  some  gates  in  a  unified  optimization  procedure,  we  are  able  to  obtain  better  solutions, 
with  smaller  execution  times  than  the  conventional  iterative  method.  For  the  first  time, 
the  gate-sizing  problem  is  combined  with  placement  in  one  formulation.  Although  the 
execution  time  for  the  combined  problem  may  be  excessively  large,  we  propose  an  indirect 
approach  to  fully  utilize  some  special  properties  of  the  formulation  to  develop  a  novel 
algorithm  which  performs  delay  and  area  optimization  for  a  given  compact  placement, 
by  resizing  and  relocating  cells  in  the  circuit  layout. 


6.1  Future  Work 

In  a  combinational  circuit,  there  are  some  paths  that  can  never  be  excited  by  any 
combination  at  the  primary  inputs.  Hence,  these  paths  can  never  be  critical  [66-68].  The 
presence  of  false  paths  in  a  circuit  causes  some  gates  to  be  sized  unnecessarily,  since  the 
optimizer  tries  to  reduce  the  delay  along  a  path  that  can  never  be  critical.  This  may  lead 
to  a  suboptimal  solution  to  the  gate-sizing  problem.  In  other  words,  although  a  physical 
level  performance  optimizer  must  certify  that  the  delay  of  the  longest  sensitizable  paths 
after  optimization  is  not  longer  than  the  specified  delay,  long  paths  are  allowed  to  exist 
in  the  optimized  circuit  if  they  are  not  sensitizable.  As  demonstrated  in  [69],  most  long 
paths  in  a  complex  circuits  are  actually  false.  As  a  result,  to  optimize  the  performance 
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of  a  large  drcuit  at  the  physical  level,  it  is  impoctant  to  consider  the  false  path  problem 
in  conjimction  with  gate  size  optimization. 

Retiming  [70]  has  been  shown  to  be  an  effective  technique  to  optimize  the  performance 
of  synchronous  sequential  circuits.  Retiming  is  an  operation  on  a  network  whereby  reg¬ 
isters  move  across  logic  blocks  in  order  to  minimize  the  clock  cycle  or  the  number  of 
registers  while  maintaining  the  behavior  of  the  circuits. 

The  retiming  technique  considers  only  the  sequential  elements  of  the  circuit;  it  as¬ 
sumes  that  the  combinational  lo^c  structure  is  fixed.  In  [71],  a  set  of  logic  synthesis 
operations  hats  been  combined  with  retiming  to  optimize  sequential  circuits  for  the  area 
and  clock  period.  Peripheral  retiming  has  been  used  to  optimize  the  performance  of 
pipelined  circuits  using  combinational  delay  optimization  techniques  [72].  While  these 
formulations  do  exist,  they  are  not  directly  relevant  to  our  work  since  we  aissume  that  we 
begin  our  optimization  at  the  end  of  the  logic  synthesis  stage.  It  has  been  shown  that 
a  retiming  algorithm  can  be  formulated  as  a  mixed  integer  linear  program  [70].  This 
approach,  however,  may  not  be  applied  directly  to  our  problem,  since  the  computationail 
complexity  involved  would  be  prohibitive.  We  propose  to  seek  methods  of  carrying  out 
retiming  through  a  series  of  inexpensive  local  optimization.  For  example,  ats  shown  in 
Figure  6.1  [9],  it  can  be  seen  that  changing  the  clock  arrival  time  at  a  flip-flop  is  equiva¬ 
lent  to  chamging  the  delay  specifications  on  the  combinationad  subcircuits  to  which  that 
flip-flop  is  connected.  The  net  effect  of  this  is  similar  to  the  moving  of  the  flip-flop  atcross 
combinational  logic  module  boundaries.  Therefore  the  solution  to  the  clock  skew  opti¬ 
mization  problem  could  also  be  interpreted  to  be  a  new  set  of  timing  specifications  for 
each  combinationad  subcircuit,  which  may  be  enforced  either  by  permitting  a  clock  skew, 
or  through  retiming. 
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Figure  6.1  Retiming  and  clock  delay  transformation. 


Methods  for  using  retiming  in  combination  with  clock  skew  for  achieving  this  change 
in  the  timing  specification  should  be  explored,  thiu  obtuning  better  solutions  for  gate-size 
optimization. 

In  a  real  chip,  the  delay  between  two  successive  logic  gates  is  composed  of  three 
elements:  (1)  intrinsic  delay  due  to  switching  a  gate  on/off,  (2)  delays  due  to  charging 
fanout  and  load  capacitance,  and  (3)  delay  due  to  distributed  RC  interconnection.  The 
scaling  rule  suggests  that  the  interconnect  delay  will  be  dominant  for  the  circuits  with 
larger  chip  si^  and  smaller  geometry.  The  effect  can  be  quite  significant  for  submicron 
circuits  since  the  interconnect  delay  grows  superlinearly  with  the  scaling  factor  and  the 
chip  dimension.  As  the  VLSI  fabrication  technology  reaches  submicron  device  dimensions 
and  gigahertz  frequencies,  it  is  necessary  to  consider  such  interconnect  delays.  The 
timing-dnven  placement  improvement  algorithm  we  propose  in  Chapter  5  uses  a  lumped 
RC  model.  However,  wires  on  scaled-down  ICs  have  significant  resistance  and  should 
be  analyzed  as  distributed  RC  lines.  In  summary,  timing-driven  placement  and  routing 
remain  the  challenges  of  today’s  submicron  device  technology. 
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Currently,  there  are  growing  demands  for  low  power  circuits  for  two  main  reasons. 
First,  as  the  device  size  and  chip  density  continue  to  increase  rapidly,  with  a  down  scale 
to  0.6  jim  at  present  (and  0.2  fim  expected)  and  over  100  MHz  clock  cycles,  it  becomes 
too  expensive  to  provide  adequate  cooling  systems  for  powerful  microprocessor  chips. 
For  example,  using  0.75  /xm  technology  and  3.3  V  power  supply,  DEC’s  Alpha  chip 
consumes  30  W  at  200  MHz  [73].  Second,  with  the  increasing  popularity  of  portable 
consumer  products  (e.g.,  laptop/notebook  computers  and  celltilar  phones),  low-power 
designs  become  a  must,  because  conventional  nickel-cadmium  battery  technolo^  provides 
only  20  W*h  of  energy  for  each  pound  of  weight  [74].  For  these  reasons,  designers  now 
are  willing  to  trade  off  area  for  low  power  consumption. 

There  has  been  active  research  related  to  low  power  designs  [75~82].  At  the  architec¬ 
ture  level,  a  parallel  implementation  can  be  used  to  maintun  throughput  while  reducing 
the  supply  voltage,  thus  reducing  the  power  consumption  [75].  At  the  circuit  level,  a 
popular  technique  is  to  turn  off  the  system  clock  for  those  parts  of  the  circuit  that  are 
not  active.  In  a  CMOS  design,  the  average  power  consumed  by  a  gate  is  given  by 

P»vg  =  ^  X  Cou«  xV^x  D  (6.1) 

where  Co^t  is  the  output  load  capacitance,  Yu  is  the  power  supply  voltage,  and  D  is  the 
transition  density  of  the  gate  [79].  Hence,  power  consumption  of  a  gate  is  determined  by 
three  factors,  namely,  Yu  sud  D.  At  the  device  level,  work  is  being  done  to  reduce 
the  peak  voltage  needed  for  switching  (reducing  V^).  Other  than  V^,  we  can  reduce  the 
power  consumed  by  a  single  gate  by  reducing  Cout  and  D.  Some  work  has  been  done  in 
the  area  of  low  power  logic  synthesis.  However,  as  in  the  area  and  delay  optimization 
in  logic  synthesis,  the  work  is  applied  to  technology-independent  synthesis,  where  gate 
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models  for  power,  delay,  and  area  are  not  very  accurate  [76].  Therefore,  it  is  important 
to  perform  low-power  design  at  the  gate-sizing  and  physical  layout  stages. 

For  the  gate-sizing  and  physical  layout  problems,  it  is  important  to  reduce  the  capac¬ 
itance  load  of  those  gates  with  large  switching  activity.  For  example,  <if  gate  i  has  large 
switching  activity,  it  is  then  desirable  to  reduce  its  capacitance  load  due  to  (1)  its  fan-out 
gates,  and  (2)  interconnect  wires.  For  case  (1),  we  should  choose  a  template  with  smaller 
input  capacitance  for  its  fan-out  gates.  Hence,  the  objective  function  of  the  optimization 
should  be  weighted  by  the  transition  density  of  each  gate.  To  deal  with  case  (2),  during 
placement  procedure,  it  is  advantageous  to  put  the  fan-out  gates  of  i  closer.  Similarly,  a 
weight  based  on  each  gate’s  transition  density  can  be  included  when  calculating  the  total 
wire  length.  Therefore,  a  low-power  driven  placement  algorithm  can  zJso  be  developed. 
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APPENDIX  A 

» 

EXTRACTING  PARAMETERS 
FROM  A  LIBRARY 


In  this  appendix,  we  will  show  how  various  parameters  needed  in  our  formulation  can 
be  calculated  from  a  given  standard-cell  library. 

As  an  example,  conside  a  three-input  AOI  (and-or-inverter)  gate  whose  output  logic 
value  is  ai  •  -t-  This  gate  has  three  inputs,  namely,  ui,  03,  and  b.  In  the  given 

library,  this  gate  has  three  templates  with  different  cell  areas  and  driving  capabilities. 
Usually,  the  library  lists  pin-to-pin  delay  information,  as  well  as  worst-case  delay.  In 
our  application,  we  use  the  worst-case  delay.  Suppose  the  characteristics  of  the  three 
templates  are  specified  as  follows. 

•  Template  1 

-  cell  area  —  1856 

-  worst-case  delay  =  x  Covt  +  t  =  3.64  x  Cout  +  0.75 

-  pin  capacitance  of  ai  s  0.123 

-  pin  capacitance  of  03  —  0.091 

-  pin  capacitance  of  6  =  0.111 

•  Template  2 

-  cell  area  =  3401 
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-  worst-case  delay  =  2.05  x  +  1.40 

-  pin  capacitance  of  ai  —  0.185 

-  pin  capacitance  of  aj  s  0.175 
--  pin  capacitance  ofb  —  0.201 

•  Template  3 

—  Cell  area  ss  5797 

-  worst-case  delay  —  1.00  x  Com*  +  2.30 

-  pin  c^adtance  of  at  s  0.405 

-  pin  capacitance  of  aj  =  0.285 

-  pin  capacitance  of  b  —  0.345 

Let  Rm  —  5.0.  From  the  above  information,  we  have 

•  ini  =  Rm/^L*  =  5.0/3.54  =  1.374 

•  u;,  =  Ru/RL  -  5.0/2.05  =  2.439 

•  W3  =  Rr,/Ria-  5.0/1.00  =  5.000 

where  wi,  W2,  and  W3  are  permissible  gate  sizes. 

To  obtain  a  linear  relationship  between  the  cell  area  and  the  gate  size,  we  linearly  ap¬ 
proximate  the  set  of  data  points  of  area  vs.  w,  {(1.374, 1856),  (2.439, 3401 ),  (5.000, 5797) } . 
Then  we  have  the  following  linear  expression  of  the  gate  area  in  terms  of  the  gate  size, 
w. 

areo  =  7  •u;-J-e  as  1060.02  •10-1-571.346  (A.l) 

Therefore,  7  =  1060.02  and  e  s  571.346.  The  data  points  and  the  affine  function  are 
plotted  in  Figure  A.l 
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Figure  A.l  Calculating  7  and  e. 


The  capacitance  at  input  pin  ai  to  be  used  to  calculate  the  loading  capacitance,  Cmt, 
of  this  gate’s  fan-in  gates  can  be  obtained  by  linearly  approximating  the  set  of  data  points 
of  ui  pin  capacitance  vs.  w,  {(1.374,0.1229),  (2.439,0.185),  (5.000,0.405)}.  This  gives 

cap(ai)  *  0(«,  •  u;  +  0.05885  •  w  +  0.03007  (A.2) 

Therefore,  aai  s  0.05885  and  =  0.03007.  The  data  points  and  the  affine  function 

are  plotted  in  Figure  A.2  The  capacitances  at  pins  a^,  03,  and  b  contribute  to  the  output 
capacitance  of  any  fan-in  gates. 

Following  the  same  procedure,  we  can  find  that  a«3  s  0.06942,  ^a2  =  —0.00033,  2uid 
06  =  0.06371,  jSt  =  0.03336. 

To  obtain  the  values  of  ri  and  Ts,  we  have  to  linearly  approximate  the  following  data 
set  of  r  vs.  u;,  {(1.374,0.75),  (2.439,1.40),  (5.00,2.30)}.  This  gives 

r  =  Ti  •  IS  +  r3  =  0.41349  •  u> -h  0.268641  (A.3) 

Therefore,  ti  =s  0.41349  and  rs  s  0.268641.  The  data  points  and  the  2iffine  function  are 

plotted  in  Figure  A.3 
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Figure  A.3  Calculating  n  and  rj. 

Finally,  the  gate  delay  can  be  approximated  by  the  following  equation: 
delay  ss  x  Co^  +  n  •  u?  +  rj 

=  ~  X  +  0.41349  •  u;  +  0.268641  (A.4) 

to 
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